Methods for genomic analysis

ABSTRACT

The present invention relates to methods for identifying variations that occur in the human genome and relating these variations to the genetic basis of disease and drug response. In particular, the present invention relates to identifying individual SNPs, determining SNP haplotype blocks and patterns, and, further, using the SNP haplotype blocks and patterns to dissect the genetic bases of disease and drug response. The methods of the present invention are useful in whole genome analysis.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims priority to and is a continuation-in-partof U.S. provisional patent application Ser. No. 60/280,530, filed Mar.30, 2001; U.S. provisional patent application Ser. No. 60/313,264 filedAug. 17, 2001; U.S. provisional patent application Ser. No. 60/327,006,filed Oct. 5, 2001, all entitled “Identifying Human SNP Haplotypes;Informative SNPs and Uses Thereof”; and U.S. provisional patentapplication Ser. No. 60/332,550, filed Nov. 26, 2001, entitled “Methodsfor Genomic Analysis”; and the present application also claims priorityto and is a continuation of U.S. patent application Ser. No. 10/106,097,filed Mar. 26, 2002, entitled “Methods for Genomic Analysis”, thedisclosures of all of which are specifically incorporated herein byreference in their entireties for all purposes.

BACKGROUND OF THE INVENTION

The DNA that makes up human chromosomes provides the instructions thatdirect the production of all proteins in the body. These proteins carryout the vital functions of life. Variations in the sequence of DNAencoding a protein produce variations or mutations in the proteinsencoded, thus affecting the normal function of cells. Althoughenvironment often plays a significant role in disease, variations ormutations in the DNA of an individual are directly related to almost allhuman diseases, including infectious disease, cancer, and autoimmunedisorders. Moreover, knowledge of genetics, particularly human genetics,has led to the realization that many diseases result from either complexinteractions of several genes or their products or from any number ofmutations within one gene. For example, Type I and II diabetes have beenlinked to multiple genes, each with its own pattern of mutations. Incontrast, cystic fibrosis can be caused by any one of over 300 differentmutations in a single gene.

Additionally, knowledge of human genetics has led to a limitedunderstanding of variations between individuals when it comes to drugresponse—the field of pharmocogenetics. Over half a century ago, adversedrug responses were correlated with amino acid variations in twodrug-metabolizing enzymes, plasma cholinesterase and glucose-6-phosphatedehydrogenase. Since then, careful genetic analyses have linked sequencepolymorphisms (variations) in over 35 drug metabolism enzymes, 25 drugtargets and 5 drug transporters with compromised levels of drug efficacyor safety (Evans and Relling, Science 296:487-91 (1999)). In the clinic,such information is being used to prevent drug toxicity; for example,patients are screened routinely for genetic differences in thethiopurine methyltransferase gene that cause decreased metabolism of6-mercaptopurine or azathiopurine. Yet only a small percentage ofobserved drug toxicities have been explained adequately by the set ofpharmacogenetic markers validated to date. Even more common thantoxicity issues may be cases where drugs demonstrated to be safe and/orefficacious for some individuals have been found to have eitherinsufficient therapeutic efficacy or unanticipated side effects in otherindividuals.

In addition to the importance of understanding the effects of variationsin the genetic make up of humans, understanding the effects of variationin the genetic makeup of other non-human organisms—particularlypathogens—is important in understanding their effect on or interactionwith humans. For example, the expression of virulence factors bypathogenic bacteria or viruses greatly affects the rate and severity ofinfection in humans that come into contact with such organisms. Inaddition, a detailed understanding of the genetic makeup of experimentalanimals, mice, rats, etc., is also of great value. For example,understanding the variations in the genetic makeup of animals used asmodel systems for evaluation of therapeutics is important forunderstanding the test results obtained using these systems and theirpredictive value for human use.

Because any two humans are 99.9% similar in their genetic makeup, mostof the sequence of the DNA of their genomes is identical. However, thereare variations in DNA sequence between individuals. For example, thereare deletions of many-base stretches of DNA, insertion of stretches ofDNA, variations in the number of repetitive DNA elements in non-codingregions, and changes in single nitrogenous base positions in the genomecalled “single nucleotide polymorphisms” (SNPs). Human DNA sequencevariation accounts for a large fraction of observed differences betweenindividuals, including susceptibility to disease.

Although most SNPs are rare, it has been estimated that there are 5.3million common SNPs, each with a frequency of 10-50%, that account forthe bulk of the DNA sequence difference between humans. Such SNPs arepresent in the human genome once every 600 base pairs (Kruglyak andNickerson, Nature Genet. 27:235 (2001)). Alleles (variants) making upblocks of such SNPs in close physical proximity are often correlated,resulting in reduced genetic variability and defining a limited numberof “SNP haplotypes”, each of which reflects descent from a single,ancient ancestral chromosome (Fullerton, et al., Am. J. Hum. Genet.67:881(2000)).

The complexity of local haplotype structure in the human genome—and thedistance over which individual haplotypes extend—is poorly defined.Empiric studies investigating different segments of the human genome indifferent populations have revealed tremendous variability in localhaplotype structure. These studies indicate that the relativecontributions of mutation, recombination, selection, population history,and stochastic events to haplotype structure vary in an unpredictablemanner, resulting in some haplotypes that extend for only a fewkilobases (kb), and others that extend for greater than 100 kb (A. G.Clark et al., Am. J. Hum. Genet. 63:595 (1998)).

These findings suggest that any comprehensive description of thehaplotype structure of the human genome, defined by common SNPs, willrequire empirical analysis of a dense set of SNPs in many independentcopies of the human genome. Such whole-genome analyses would provide afine degree of genetic mapping and pinpoint specific regions of linkage.Until the present invention, however, the practice and cost ofgenotyping over 3,000,000 SNPs across each individual of a reasonablysized population has made this endeavor impractical. The presentinvention allows for, among a wide variety of applications, whole-genomeassociation analysis of populations using SNP haplotypes.

SUMMARY OF THE INVENTION

The present invention relates to methods for identifying variations thatoccur in the human genome and relating these variations to the geneticbases of phenotype such as disease resistance, disease susceptibility ordrug response. “Disease” includes but is not limited to any condition,trait or characteristic of an organism that it is desirable to change.For example, the condition may be physical, physiological orpsychological and may be symptomatic or asymptomatic. The methods allowfor identification of variants, identification of SNPs, determination ofSNP haplotype blocks, determining SNP haplotype patterns, and further,identification of informative SNPs for each pattern, which affordsgenetic data compression.

Thus, one aspect of the present invention provides methods for selectingSNP haplotype patterns useful in data analysis. Such selection can beaccomplished by isolating substantially identical (homologous) nucleicacid strands from a plurality of individuals; determining SNP locationsin each nucleic acid strand; identifying the SNP locations in thenucleic acid strands that are linked, where the linked SNP locationsform a SNP haplotype block; identifying isolate SNP haplotype blocks;identifying SNP haplotype patterns that occur in each SNP haplotypeblock; and selecting the identified SNP haplotype patterns that occur inat least two of the substantially identical nucleic acid strands. In onepreferred embodiment, nucleic acid strands from at least about 10different individuals or origins are used In a more preferredembodiment, nucleic acid strands from at least 16 different origins areused. In an even more preferred embodiment, nucleic acid strands from atleast 25 different origins are used, and in a yet more preferredembodiment, nucleic acid strands from at least 50 different origins areused. Further, a more preferred embodiment would determine SNP locationsin at least about 100 nucleic acid strands from different origins. Inaddition, this method may further comprise selecting the SNP haplotypepattern that occurs most frequently in the substantially identicalnucleic acid strands; selecting the SNP haplotype pattern that occursnext most frequently in the substantially identical nucleic acidstrands; and repeating the selecting until the selected SNP haplotypepatterns identify a portion of interest of the substantially identicalnucleic acid strands. In a preferred embodiment, the portion of interestis between 70% and 99% of the substantially identical nucleic acidstrands, and, in a more preferred embodiment, the portion of interest isabout 80% of the substantially identical nucleic acid strands.Alternatively, one may wish to limit the selection of SNP haplotypepatterns to no more than about three SNP haplotype patterns per SNPhaplotype block.

In addition, the present invention provides a method for selecting adata set of SNP haplotype blocks for data analysis, comprising comparingSNP haplotype blocks for informativeness; selecting a first SNPhaplotype block with high informativeness; adding the first SNPhaplotype block to the data set; selecting a second SNP haplotype blockwith high informativeness; adding the second selected SNP haplotypeblock to the data set; and repeating the selecting and adding stepsuntil the region of interest of a DNA strand is covered. In preferredembodiments, the SNP haplotype blocks selected are non-overlapping.

The present invention further provides methods for determining at leastone informative SNP in a SNP haplotype pattern, comprising firstdetermining SNP haplotype patterns for a SNP haplotype block, thencomparing each SNP haplotype pattern of interest in the SNP haplotypeblock to the other SNP haplotype patterns of interest in the SNPhaplotype block, and selecting at least one SNP in each SNP haplotypepattern that distinguishes this SNP haplotype pattern of interest fromthe other SNP haplotype patterns of interest in the SNP haplotype block.The selected SNP (or SNPs) is an informative SNP for the SNP haplotypepattern.

Also, the present invention allows for rapid scanning of genomic regionsand provides a method for determining disease-related genetic loci orpharmacogenomic-related loci without a priori knowledge of the sequenceor location of the disease-related genetic loci orpharmacogenomic-related loci. This can be done by determining SNPhaplotype, patterns from individuals in a control population, thendetermining SNP haplotype patterns from individuals in a experimentalpopulation, such as individuals in a diseased population or individualsthat react in a particular manner when administered a drug. Thefrequencies of the SNP haplotype patterns of the control population arecompared to the frequencies of the SNP haplotype patterns of theexperimental population. Differences in these frequencies indicatelocations of disease-related genetic loci or pharmacogenomic-relatedloci.

An additional aspect of the present invention provides a method ofmaking associations between SNP haplotype patterns and a phenotypictrait of interest comprising: building baseline of SNP haplotypepatterns of control individuals by the methods of the present invention;pooling whole genomic DNA from a clinical population having a commonphenotypic trait of interest; and identifying the SNP haplotype patternsthat are associated with the phenotypic trait of interest. Thus, thepresent invention allows for genome scanning to identify multiplehaplotype blocks associated with a phenotype, which is particularlyuseful when studying polygenic traits.

Also, the present invention provides a method for identifying drugdiscovery targets comprising: associating SNP haplotype patterns with adisease; identifying a chromosomal location of the associated SNPhaplotype patterns; determining the nature of the association of thechromosomal location and said disease; and using the gene or geneproduct of the chromosomal location as a drug discovery target.

BRIEF DESCRIPTION OF THE FIGURES

The following figures and drawings form part of the presentspecification and are included to further demonstrate certain aspects ofthe patent invention. The invention may be better understood byreference to one or more of these drawings in combination with thedetailed description of the specific embodiments presented herein. FIG.1 is a schematic of one embodiment of the methods of the presentinvention from identifying variant locations to associating variantswith phenotype, to using the associations to identify drug discoverytargets or as diagnostic markers.

FIG. 2 shows sample SNP haplotype blocks and SNP haplotype patternsaccording to the present invention.

FIG. 3 is a schematic showing one embodiment of a method for selectingSNP haplotype blocks.

FIG. 4 illustrates a simple employment of one embodiment of the methodshown in FIG. 3.

FIG. 5A is a schematic of one embodiment of a method for choosing afinal set of SNP haplotype blocks. FIG. 5B is a simple employment of themethod shown in FIG. 5A. The “letter:number” designations in FIG. 5Bindicate “haplotype block ID:informativeness value” for each block.

FIG. 6 shows an example of how informative SNPs may be selectedaccording to one embodiment of the present invention.

FIG. 7A is a schematic showing one embodiment for resolving variantambiguities and/or SNP haplotype pattern ambiguities. FIG. 7Billustrates a simple employment of the method shown in FIG. 7A.

FIG. 8 is a schematic of one embodiment of using the methods of thepresent invention in an association study.

FIG. 9 shows an exemplary computer network system suitable for executingsome embodiments of the present invention.

FIG. 10 is a schematic of the construction of somatic cell hybrids.

FIG. 11 is a table illustrating a portion of results obtained fromscreening hamster-human cell hybrids with the HuSNP genechip fromAffymetrix, Inc.

FIG. 12 shows an example of various amplified genomic regions of humanchromosome 22 and human chromosome 14 genomic DNA using long range PCR.

FIG. 13A is a bar graph showing the percentage of SNPs plotted againstthe frequency of the minor allele (variant) of the SNP. FIG. 13B is agraph of the percentage of 200 kb intervals as a function of thenucleotide diversity in the interval. FIG. 13C is a bar graph showingthe percentage of all intervals plotted against interval length.

FIG. 14 shows the haplotype patterns for twenty independent globallydiverse chromosomes defined by 147 common human chromosome 21 SNPs.

FIG. 15 is a plot of the fraction of chromosome covered as a function ofthe number of SNPs required for that coverage.

The present invention relates to methods for identifying variations thatoccur in the human genome and relating these variations to the geneticbasis of disease and drug response. In particular, the present inventionrelates to identifying individual SNPs, determining SNP haplotype blocksand patterns, and, further, using the SNP haplotype blocks and patternsto dissect the genetic bases of disease and drug response. The methodsof the present invention are useful in whole genome analysis.

DETAILED DESCRIPTION OF THE INVENTION

It readily should be apparent to one skilled in the art that variousembodiments and modification may be made to the invention disclosed inthis application without departing from the scope and spirit of theinvention. All publications mentioned herein are cited for the purposeof describing and disclosing reagents, methodologies and concepts thatmay be used in connection with the present invention. Nothing herein isto be construed as an admission that these references are prior art inrelation to the inventions described herein.

As used in the specification, “a” or “an” means one or more. As used inthe claim(s), when used in conjunction with the word “comprising”, thewords “a” or “an” mean one or more As used herein, “another” means atleast a second or more.

As used herein, when the term “different origins” is used, it refers tothe fact DNA strands from different organisms come from a differentorigin. Further, each DNA strand in a single organism's genome come fromdifferent origins. In a diploid organism, an individual organism'sgenome is made up of a set of pairs of substantially identical DNAstrands. That is, a single individual would have substantially identicalDNA strands from two different origins—one DNA strand of the pair is ofmaternal origin and one DNA strand of the pair is of paternal origin.Two or more nucleic acid sequences—for example, two or more DNAstrands—are considered to be substantially identical if they exhibit atleast about 70% sequence identity at the nucleotide level, preferablyabout 75%, more preferably about 80%, still more preferably about 85%,yet more preferably about 90%, even more preferably about 95% and evenmore preferably nucleic acid sequences are considered to besubstantially identical if they exhibit at least about 98% sequenceidentity at the nucleotide level. The extent of sequence identity thatis relevant between two or more nucleic acid sequences will depend onthe host source of the nucleic acids. For example, a greater than 95%sequence identity may be relevant when looking at same speciescomparisons, whereas a sequence identity of 70% or even less may berelevant when making cross species comparisons. Of course, when onerefers to DNA herein such reference may include derivatives of DNA suchas amplicons, RNA transcripts, nucleic acid mimetics, etc.

As used herein, “individual” refers to a specific single organism, suchas a single animal, human insect, bacterium, etc.

As used herein, “informativeness” of a SNP haplotype block is defined asthe degree to which a SNP haplotype block provides information aboutgenetic regions.

As used herein, the term “informative SNP” refers to a genetic variantsuch as a SNP or subset (more than one) of SNPs that tends todistinguish one SNP haplotype pattern from other SNP haplotype patternswithin a SNP haplotype block.

As used herein, the term “isolate SNP block” refers to a SNP haplotypeblock that consists of one SNP.

As used herein, the term “linkage disequilibrium”, “linked” or “LD”refers to genetic loci that tend to be transmitted from generation togeneration together, e.g., genetic loci that are inherited non-randomly.

As used herein, the term “singleton SNP haplotype” or “singleton SNP”refers to a specific SNP allele or variant that occurs in less than acertain portion of the population.

As used herein, the term “SNP” or “single nucleotide polymorphism”refers to a genetic variation between individuals; e.g., a singlenitrogenous base position in the DNA of organisms that is variable. Asused herein, “SNPs” is the plural of SNP. Of course, when one refers toDNA herein such reference may include derivatives of DNA such asamplicons, RNA transcripts, etc.

As used herein, the term “SNP haplotype block” means a group of variantor SNP locations that do not appear recombine independently and that canbe grouped together in blocks of variants or SNPs.

As used herein, the term “SNP haplotype pattern” refers to the set ofgenotypes for SNPs in a SNP haplotype block in a single DNA strand.

As used herein, the term “SNP location” is the site in a DNA sequencewhere a SNP occurs.

As used herein a “SNP haplotype sequence” is a DNA sequence in a DNAstrand that contains at least one SNP location.

Preparation of Nucleic Acids for Analysis

Nucleic acid molecules may be prepared for analysis using any techniqueknown to those skilled in the art. Preferably such techniques result inthe production of a nucleic acid molecule sufficiently pure to determinethe presence or absence of one or more variations at one or morelocations in the nucleic acid molecule. Such techniques may be found,for example, in Sambrook, et al., Molecular Cloning: A Laboratory Manual(Cold Spring Harbor Laboratory, New York) (1989), and Ausubel, et al.,Current Protocols in Molecular Biology (John Wiley and Sons, New York)(1997), incorporated herein by reference.

When the nucleic acid of interest is present in a cell, it may benecessary to first prepare an extract of the cell and then performfurther steps—i.e., differential precipitation, column chromatography,extraction with organic solvents and the like—in order to obtain asufficiently pure preparation of nucleic acid. Extracts may be preparedusing standard techniques in the art, for example, by chemical ormechanical lysis of the cell. Extracts then may be further treated, forexample, by filtration and/or centrifugation and/or with chaotropicsalts such as guanidinium isothiocyanate or urea or with organicsolvents such as phenol and/or HCCl₃ to denature any contaminating andpotentially interfering proteins. When chaotropic salts are used, it maybe desirable to remove the salts from the nucleic acid-containingsample. This can be accomplished using standard techniques in the artsuch as precipitation, filtration, size exclusion chromatography and thelike.

In some instances, it may be desirable to extract and separate messengerRNA from cells. Techniques and material for this purpose are known tothose skilled in the art and may involve the use of oligo dT attached toa solid support such as a bead or plastic surface. Suitable conditionsand materials are known to those skilled in the art and may be found inthe Sambrook and Ausubel references cited above. It may be desirable toreverse transcribe the mRNA into cDNA using, for example, a reversetranscriptase enzyme. Suitable enzymes are commercially available from,for example, Invitrogen, Carlsbad Calif. Optionally, cDNA prepared frommRNA may then be amplified.

One approach particularly suitable for examining haplotype patterns andblocks is using somatic cell genetics to separate chromosomes from adiploid state to a haploid state. In one embodiment, a humanlymphoblastoid cell line that is diploid may be fused to a hamsterfibroblast cell line that is also diploid such that the humanchromosomes are introduced into the hamster cells to produce cellhybrids. The resulting cell hybrids are examined to determine whichhuman chromosomes were transferred, and which, if any, of thetransferred human chromosomes are in a haploid state (see, e.g.,Patterson, et al., Annal. N.Y. Acad. Of Sciences, 396:69-81 (1982)).

A schematic of the procedure is shown in FIG. 10. FIG. 10 shows adiploid human lymphoblastoid cell line that is wildtype for thethymidine kinase gene being fused to a diploid hamster fibroblast cellline containing a mutation in the thymidine kinase gene. In asub-population of the resulting cells, human chromosomes are present inhybrids. Selection for the human DNA-containing hybrid cells is achievedby utilizing HAT medium (selective medium). Only hybrid cells that havea stably-incorporated human DNA strand having the wildtype humanthymidine kinase gene grow in cell culture medium containing HAT. Of theresulting hybrids, some hybrids may contain both copies of some humanchromosomes, only one copy of a human chromosome or no copies of aparticular human chromosome. For example, for a human chromosome 22having a locus with either an A or a B allele, the resulting hybridcells may contain one human chromosome 22 variant (e.g., the “A”variant) or a portion thereof, some may contain the other humanchromosome 22 variant (the “B” variant) or a portion thereof, some maycontain both human chromosome 22 variants or portions thereof, and somehybrids may not contain any portion of a human chromosome 22 at all InFIG. 10, only two of the resulting hybrid populations are shown. Oncethe appropriate hybrids are selected, the nucleic acids from thesehybrids may be isolated by, for example, the techniques described aboveand then subjected to SNP discovery, and haplotype block and patternanalyses of the present invention.

Amplification Techniques

It may be desirable to amplify one or more nucleic acids of interestbefore determining the presence or absence of one or more variations inthe nucleic acid. Nucleic acid amplification increases the number ofcopies of the nucleic acid sequence of interest. Any amplificationtechnique known to those of skill in the art may be used in conjunctionwith the present invention including, but not limited to, polymerasechain reaction (PCR) techniques. PCR may be carried out using materialsand methods known to those of skill in the art.

PCR amplification generally involves the use of one strand of a nucleicacid sequence as a template for producing a large number of complementsto that sequence. The template may be hybridized to a primer having asequence complementary to a portion of the template sequence andcontacted with a suitable reaction mixture including dNTPs and apolymerase enzyme. The primer is elongated by the polymerase enzymeproducing a nucleic acid complementary to the original template.

For the amplification of both strands of a double stranded nucleic acidmolecule, two primers may be used, each of which may have a sequencewhich is complementary to a portion of one of the nucleic acid strands.Elongation of the primers with a polymerase enzyme results in theproduction of two double-stranded nucleic acid molecules each of whichcontains a template strand and a newly synthesized complementary strand.The sequences of the primers typically are chosen such that extension ofeach of the primers results in elongation toward the site in the nucleicacid molecule where the other primer hybridize.

The strands of the nucleic acid molecules are denatured—for example, byheating—and the process is repeated, this time with the newlysynthesized strands of the preceding step serving as templates in thesubsequent steps. A PCR amplification protocol may involve a few to manycycles of denaturation, hybridization and elongation reactions toproduce sufficient amounts of the desired nucleic acid.

Although PCR methods typically employ heat to achieve stranddenaturation and allow subsequent hybridization of the primers, anyother means that results in making the nucleic acids available forhybridization to the primers may be used Such techniques include, butare not limited to, physical, chemical, or enzymatic means, for example,by inclusion of a helicase, (see Radding, Ann. Rev. Genetics 16: 405-436(1982)) or by electrochemical means (see PCT Application Nos. WO92/04470 and WO 95/25177).

Template-dependent extension of primers in PCR is catalyzed by apolymerase enzyme in the presence of at least 4 deoxyribonucleotidetriphosphates (typically selected from dATP, dGTP, dCTP, dUTP and dTTP)in a reaction medium which comprises the appropriate salts, metalcations, and pH buffering system. Suitable polymerase enzymes are knownto those of skill in the art and may be cloned or isolated from naturalsources and may be native or mutated forms of the enzymes. So long asthe enzymes retain the ability to extend the primers, they may be usedin the amplification reactions of the present invention.

The nucleic acids used in the methods of the invention may be labeled tofacilitate detection in subsequent steps. Labeling may be carried outduring an amplification reaction by incorporating one or more labelednucleotide triphosphates and/or one or more labeled primers into theamplified sequence. The nucleic acids may be labeled followingamplification, for example, by covalent attachment of one or moredetectable groups. Any detectable group known to those skilled in theart may be used, for example, fluorescent groups, ligands and/orradioactive groups. An example of a suitable labeling technique is toincorporate nucleotides containing labels into the nucleic acid ofinterest using a terminal deoxynucleotidyl transferase (TdT) enzyme. Forexample, a nucleotide—preferably a dideoxy nucleotide—containing a labelis incubated with the nucleic acid to be labeled and a sufficient amountof TdT to incorporate the nucleotide. A preferred nucleotide is adideoxynucleotide—i.e., ddATP, ddGTP, ddCTP, ddTTP, etc—having a biotinlabel attached.

Techniques to optimize the amplification of long sequences may be usedSuch techniques work well on genomic sequences. The methods disclosed inpending US patent applications U.S. Ser. No. 60/317,311, filed Sep. 5,2001; U.S. Ser. No. [unassigned], attorney docket number 1011N-1, filedJan. 9, 2002 entitled “Algorithms for Selection of Primer Pairs”; andU.S. Ser. No. [assigned], attorney docket number 1011N1D1, filed Jan. 9,2002, entitled “Methods for Amplification of Nucleic Acids” areparticularly suitable for amplifying genomic DNA for use in the methodsof the present invention.

Amplified sequences may be subjected to other post amplificationtreatments either before or after labeling. For example, in some cases,it may be desirable to fragment the amplified sequence prior tohybridization with an oligonucleotide array. Fragmentation of thenucleic acids generally may be carried out by physical, chemical orenzymatic methods that are known in the art. Suitable techniquesinclude, but are not limited to, subjecting the amplified nucleic acidsto shear forces by forcing the nucleic acid containing fluid samplethrough a narrow aperture or digesting the PCR product with a nucleaseenzyme. One example of a suitable nuclease enzyme is Dnase I. Afteramplification, the PCR product may be incubated in the presence of anuclease for a period of time designed to produce appropriately sizedfragments. The sizes of the fragments may be varied as desired, forexample, by increasing the amount of nuclease or duration of incubationto produce smaller fragments or by decreasing the amount of nuclease orperiod of incubation to produce larger fragments. Adjusting thedigestion conditions to produce fragments of the desired size is withinthe capabilities of a person of ordinary skill in the art. The fragmentsthus produced may be labeled as described above.

Methods for the Detection of SNPs (SNP Discovery)

Determination of the presence or absence of one or more variations in anucleic acid may be made using any technique known to those of skill inthe art. Any technique that permits the accurate determination of avariation can be used. Preferred techniques will permit rapid, accuratedetermination of multiple variations with a minimum of sample handlingrequired. Some examples of suitable techniques are provided below.

Several methods for DNA sequencing are well known and generallyavailable in the art and may be used to determine the location of SNPsin a genome. See, for example, Sambrook, et al., Molecular Cloning: ALaboratory Manual (Cold Spring Harbor Laboratory, New York) (1989), andAusubel, et al., Current Protocols in Molecular Biology (John Wiley andSons, New York) (1997), incorporated herein by reference. Such methodsmay be used to determine the sequence of the same genomic regions fromdifferent DNA strands where the sequences are then compared and thedifferences (variations between the strands) are noted. DNA sequencingmethods may employ such enzymes as the Klenow fragment of DNA polymeraseI, Sequenase (US Biochemical Corp, Cleveland, Ohio.), Taq polymerase(Perkin Elmer), thermostable T7 polymerase (Amersham, Chicago, Ill.), orcombinations of polymerases and proofreading exonucleases such as thosefound in the Elongase Amplification System marketed by Gibco/BRL(Gaithersburg, Md.). Preferably, the process is automated with machinessuch as the Hamilton Micro Lab 2200 (Hamilton, Reno, Nev.), PeltierThermal Cycler (PTC200; MJ Research, Watertown, Mass.) and the ABICatalyst and 373 and 377 DNA Sequencers (Perkin Elmer, Wellesley,Mass.).

In addition, capillary electrophoresis systems which are commerciallyavailable may be used to perform variation or SNP analysis. Inparticular, capillary sequencing may employ flowable polymers forelectrophoretic separation, four different fluorescent dye (one for eachnucleotide) which are laser activated, and detection of the emittedwavelengths by a charge coupled device camera. Output/light intensitymay be converted to electrical signal using appropriate software (e.g.Genotyper and Sequence Naviagator, Perkin Elmer, Wellesley, Mass.) andthe entire process from loading of samples to computer analysis andelectronic data display may be computer controlled. Again, this methodmay be used to determine the sequence of the same genomic regions fromdifferent DNA strands where the sequences are then compared and thedifferences (variations between the strands) are noted.

Optionally, once a genomic sequence from one reference DNA strand hasbeen determined by sequencing, it is possible to use hybridizationtechniques to determine variations in sequence between the referencestrand and other DNA strands. These variations may be SNPs. An exampleof a suitable hybridization technique involves the use of DNA chips(oligonucleotide arrays), for example, those available from Affymetrix,Inc. Santa Clara, Calif. For details on the use of DNA chips for thedetection of, for example, SNPs, see U.S. Pat. No. 6,300,063 issued toLipshultz, et al., and U.S. Pat. No. 5,837,832 to Chee, et al., HuSNPMapping Assay, reagent kit and user manual, Affymetrix Part No. 90094(Affymetrix, Santa Clara, Calif.), all incorporated by reference herein.

In preferred embodiments, more than 10,000 bases of a reference sequenceand the other DNA strands are scanned for variants. In more preferredembodiments, more than 1×10⁶ bases of a reference sequence and the otherDNA strands are scanned for variants, even more preferably more than2×10⁶ bases of a reference sequence and the other DNA strands arescanned, even more preferably 1×10⁷ bases are scanned, and morepreferably more than 1×10⁸ bases are scanned, and more preferably morethan 1×10⁹ bases of a reference sequence and the other DNA strands arescanned for variants. In preferred embodiments at least exons arescanned for variants, and in more preferred embodiments both introns andexons are scanned for variants. In an even more preferred embodiment,introns, exons and intergenic sequences are scanned for variants. Inpreferred embodiments the scanned nucleic acids are genomic DNA,including both coding and noncoding regions. In most preferredembodiments, such DNA is from a mammalian organism such as a human. Inpreferred embodiments, more than 10% of the genomic DNA from theorganism is scanned, in more preferred embodiments more than 25% of thegenomic DNA from the organism is scanned, in more preferred embodiments,more than 50% of the genomic DNA from the organism is scanned, and inmost preferred embodiments, more than 75% of the genomic DNA is scanned.In some embodiments of the present invention, known repetitive regionsof the genome are not scanned, and do not count toward the percentage ofgenomic DNA scanned. Such known repetitive regions may include SingleInterspersed Nuclear Elements (SINEs, such as alu and MIR sequences),Long Interspersed Nuclear Elements (LINEs, such as LINE1 and LINE2sequences), Long Terminal Repeats (LTRs such as MaLRs, Retrov and MER4sequences), transposons, and MER1 And MER2 sequences.

Briefly, in one embodiment, labeled nucleic acids in a suitable solutionare denatured—for example, by heating to 95° C.—and the solutioncontaining the denatured nucleic acids is incubated with a DNA chip.After incubation, the solution is removed, the chip may be washed with asuitable washing solution to remove un-hybridized nucleic acids, and thepresence of hybridized nucleic acids on the chip is detected. Thestringency of the wash conditions may be adjusted as necessary toproduce a stable signal: Detecting the hybridized nucleic acids may bedone directly, for example, if the nucleic acids contain a fluorescentreporter group, fluorescence may be directly detected. If the label onthe nucleic acids is not directly detectable, for example, biotin, thena solution containing a detectable label, for example, streptavidincoupled to phycoerythrin, may be added prior to detection. Otherreagents designed to enhance the signal level may also be added prior todetection, for example, a biotinylated antibody specific forstreptavidin may be used in conjunction with the biotin,streptavidin-phycoerythrin detection system. In some embodiments, theoligonucleotide arrays, used in the methods of the present inventioncontain at least 1×10⁶ probes per array. In a preferred embodiment, theoligonucleotide arrays used in the methods of the present inventioncontain at least 10×10⁶ probes per array. In a more preferredembodiment, the oligonucleotide arrays used in the methods of thepresent invention contain at least 50×10⁶ probes per array.

Once variant locations have been determined (SNP discovery) by using,for example, sequencing or microarray analysis, it is necessary togenotype the SNPs of control and sample populations. The hybridizationmethods just described work well for this purpose, providing an accurateand rapid technique for detecting and genotyping SNPs in multiplesamples. In addition, a technique suitable for the detection of SNPs ingenomic DNA—without amplification—is the Invader technology availablefrom Third Wave Technologies, Inc., Madison, Wis. Use of this technologyto detect SNPs may be found, e.g., in Hessner, et al., ClinicalChemistry 46(8):1051-56 (2000); Hall, et al., PNAS 97(15):8272-77(2000); Agarwal, et al., Diag. Molec. Path. 9(3):158-64 (2000); andCooksey, et al., Antimicrobial and Chemotherapy 44(5):1296-1301 (2000).In the Invader process, two short DNA probes hybridize to a targetnucleic acid to form a structure recognized by a nuclease enzyme. ForSNP analysis, two separate reactions are run—one for each SNP variant.If one of the probes is complementary to the sequence, the nuclease willcleave it to release a short DNA fragment termed a “flap”. The flapbinds to a fluorescently-labeled probe and forms another structurerecognized by a nuclease enzyme. When the enzyme cleaves the labeledprobe, the probe emits a detectable fluorescence signal therebyindicating which SNP variant is present.

An alternative to Invader technology, rolling circle amplificationutilizes an oligonucleotide complementary to a circular DNA template toproduce an amplified signal (see, for example, Lizardi, et al., NatureGenetics 19(3):225-32 (1998); and Zhong, et al., PNAS 98(7):3940-45(2001)). Extension of the oligonucleotide results in the production ofmultiple copies of the circular template in a long concatemer.Typically, detectable labels are incorporated into the extendedoligonucleotide during the extension reaction. The extension reactioncan be allowed to proceed until a detectable amount of extension productis synthesized.

In order to detect SNPs using rolling circle amplification, three probesand two circular DNA templates may be used. The first probe—the targetspecific probe—may be constructed to be complementary to a targetnucleic acid molecule such that the 5′-terminus of the probe hybridizesto the nucleotide immediately adjacent 5′ to the. SNP site in the targetnucleic acid. The site of the SNP is not base paired to the first probe.

The other two probes—rolling circle probes—are constructed to have two3′-terminals. This can be accomplished in various ways, for example, byintroducing a 5′-5′ linkage in the central portion of the probesresulting in a reversal of polarity of the nucleotide sequence at thatpoint. One end of each of the probes has a sequence that iscomplementary to a portion of a different circular template moleculewhile the other end is complementary to a portion of the target nucleicacid sequence. The target-sequence-complementary terminal is constructedsuch that the 3′-most nucleotide aligns with the nucleotide at the SNPsite. One of the probes may contain a nucleotide complementary to thenucleotide at the SNP site in the target nucleic acid while the othercontains a nucleotide that is not complementary. In the instance wheretwo or more variants of the SNP are present in the population, probesmay be constructed to have 3′-nucleotides complementary to the variantsto be detected.

The probes—both target specific and rolling circle—may be hybridized tothe target sequence and contacted with a ligase enzyme. When the 3′-mostnucleotide of the rolling circle probe forms a base pair with thenucleotide at the SNP site, the two probes—the target specific and therolling circle—are efficiently ligated together. When the 3′-mostnucleotide of the rolling circle probe is not capable of base pairingwith the nucleotide at the SNP site in the target, the probes are notligated. The unligated probe is washed away and the sample is contactedwith the template circles, polymerase and labeled nucleosidetriphosphates.

Another technique suitable for the detection of SNPs makes use of the5′-exonuclease activity of a DNA polymerase to generate a signal bydigesting a probe molecule to release a fluorescently labelednucleotide. This assay is, frequently referred to as a Taqman assay(see, e.g., Arnold, et al., BioTechniques 25(1):98-106 (1998); andBecker, et al., Hum. Gene Ther. 10:2559-66 (1999)). A target DNAcontaining a SNP is amplified in the presence of a probe molecule thathybridizes to the SNP site. The probe molecule contains both afluorescent reporter-labeled nucleotide at the 5′-end and aquencher-labeled nucleotide at the 3′-end. The probe sequence isselected so that the nucleotide in the probe that aligns with the SNPsite in the target DNA is as near as possible to the center of the probeto maximize the difference in melting temperature between the correctmatch probe and the mismatch probe. As the PCR reaction is conducted,the correct match probe hybridizes to the SNP site in the target DNA andis digested by the Tag polymerase used in the PCR assay. This digestionresults in physically separating the fluorescent labeled nucleotide fromthe quencher with a concomitant increase in fluorescence. The mismatchprobe does not remain hybridized during the elongation portion of thePCR reaction and is, therefore, not digested and the fluorescentlylabeled nucleotide remains quenched.

Denaturing HPLC using a polystyrene-divinylbenzene reverse phase columnand an ion-pairing mobile phase can be used to identify SNPs. A DNAsegment containing a SNP is PCR amplified. After amplification, the PCRproduct is denatured by heating and mixed with a second denatured PCRproduct with a known nucleotide at the SNP position. The PCR productsare annealed and are analyzed by HPLC at elevated temperature. Thetemperature is chosen to denature duplex molecules that are mismatchedat the SNP location but not to denature those that are perfect matches.Under these conditions, heteroduplex molecules typically elute beforehomoduplex molecules. For an example of the use of this technique seeKota, et al., Genome 44(4):523-28 (2001).

SNPs can be detected using solid phase amplification and microsequencingof the amplification product. Beads to which primers have beencovalently attached are used to carry out amplification reactions. Theprimers are designed to include a recognition site for a Type IIrestriction enzyme. After amplification—which results in a PCR productattached to the bead—the product is digested with the restrictionenzyme. Cleavage of the product with the restriction enzyme results inthe production of a single stranded portion including the SNP site and a3′-OH that can be extended to fill in the single stranded portion.Inclusion of ddNTPs in an extension reaction allows direct sequencing ofthe product. For an example of the use of this technique to identifySNPs see Shapero, et al., Genome Research 11(11):1926-34 (2001).

Data Analysis

FIG. 1 is a schematic showing the steps of one embodiment of the methodsof the present invention. Once SNPs (variants) have been located ordiscovered by, e.g., the methods described supra (step 110 of FIG. 1),SNP haplotype blocks, SNP haplotype patterns within each SNP haplotypeblock, and informative SNPs for the SNP haplotype patterns may bedetermined. One may use all SNPs or variants located; alternatively, onemay focus the analysis on only a portion of the SNPs located. Forexample, the set of SNPs analyzed may exclude transition SNPs of theform Cg<−>Tg or cG<−>cA. In addition, in one embodiment of the presentinvention, the focus is on common SNPs. Common SNPs are those SNPs whoseless common form is present at a minimum frequency in a givenpopulation. For example, common SNPs are those SNPs that are found in atleast about 2% to 25% of the population. In a preferred embodiment,common SNPs are those SNPs that are found in at least about 5% to 15% ofthe population. In a more preferred embodiment, common SNPs are thosethat are found in at least about 10% of the population. Common SNPslikely result from mutations that occurred early in the evolution ofhumans. Focusing on common SNPs minimizes systematic allele or variantdifferences between control and experimental populations that appear asdisease or drug-response associated, yet result only from migratoryhistory or mating practices; i.e., focusing on common SNPs decreases thefalse positives that result from recent population anomalies. Moreover,common SNPs are relevant to a larger proportion of the human population,making the present invention more broadly applicable to disease and drugresponse studies. Along the same line, SNPs in which an variant isobserved only once may be eliminated from analysis in some embodimentsof the present invention (for example, singleton SNPs). However, certainanalyses may be performed including some or all of these singleton SNPs,particularly when looking at specific sub-populations or populationsthat have been influenced by migratory practices and the like.

In step 120 of FIG. 1, the variants or SNPs of interest are assigned tohaplotype blocks for evaluation. Variants or SNPs from a whole genome orchromosome may be analyzed and assigned to SNP haplotype blocks.Alternatively, variants from only a focused genomic region specific tosome disease or drug response mechanism may be assigned to the SNPhaplotype blocks.

FIG. 2 provides one illustration of showing how variants, usually SNPs,occur in haplotype blocks in a genome, and that more than one haplotypepattern can occur within each haplotype block. If SNP haplotype patternswere completely random, it would be expected that the number of possibleSNP haplotype patterns observed for a SNP haplotype block of N SNPswould be 2^(N). However, it was observed in performing the methods ofthe present invention that the number of SNP haplotype patterns in eachSNP haplotype block is smaller than 2^(N) because the SNPs are linked(not 4^(N), as the variants will most commonly be biallelic, i.e., occurin only one of two forms, not all four nucleotide base possibilities).Certain SNP haplotype patterns were observed at a much higher frequencythan would be expected in a non-linkage case. Thus, SNP haplotype blocksare chromosomal regions that tend to be inherited as a unit, with arelatively small number of common patterns. Each line in FIG. 2represents portions of the haploid genome sequence of differentindividuals. As shown therein, individual W has an “A” at position 241,a “G” at position 242, and an “A” at position 243. Individual X has thesame bases at positions 241, 242, and 243. Conversely, individual Y hasa T at positions 241 and 243, but an A at position 242. Individual Z hasthe same bases as individual Y at positions 241, 242, and 243. Variantsin block 261 will tend to occur together. Similarly, the variants inblock 262 will tend to occur together, as will those variants in block263. Of course, only a few bases in a genome are shown in FIG. 2. Infact, most bases will be like those at position 245 and 248, and willnot vary from individual to individual.

The assignment of SNPs to SNP haplotype blocks, step 120 of FIG. 1, is,in one case, an iterative process involving the construction of SNPhaplotype blocks from the SNP locations along a genomic region ofinterest. In one embodiment, once the initial SNP haplotype blocks areconstructed, SNP haplotype patterns present in the constructed SNPhaplotype blocks are determined (step 130 of FIG. 1). In some specificembodiments, the number of SNP haplotype patterns selected per SNPhaplotype block in step 130 is no greater than about five. In anotherspecific embodiment, the number of SNP haplotype patterns selected perSNP haplotype block is equal to the number of SNP haplotype patternsnecessary to identify SNP haplotype patterns in greater than 50% of theDNA strands being analyzed. In other words, enough SNP haplotypepatterns are selected, for example, four patterns per block areselected, such that at least half of the DNA strands, analyzed will havea SNP haplotype pattern that matches one of the four patterns selectedin each SNP haplotype block. In a preferred embodiment, the number ofSNP haplotype patterns selected per SNP haplotype block is equal to thenumber of SNP haplotype patterns necessary to identify SNP haplotypepatterns in greater than 70% of the DNA strands being analyzed. In onepreferred embodiment, the number of SNP haplotype patterns selected perSNP haplotype block is equal to the number of SNP haplotype patternsnecessary to identify SNP haplotype patterns in greater than 80% of theDNA strands being analyzed. In addition, in some embodiments of thepresent invention, SNP haplotype patterns that occur in less than acertain portion of DNA strands being analyzed are eliminated fromanalysis. For example, in one embodiment, if ten DNA strands are beinganalyzed, SNP haplotype patterns that are found to occur in only onesample out of ten are eliminated from analysis.

Once the SNP haplotype patterns of interest are selected, informativeSNPs for these SNP haplotype patterns are determined (step 140 of FIG.1). From this initial set of blocks, a set of candidate SNP blocks thatfit certain criteria for informativeness is constructed (step 150 ofFIG. 1). FIGS. 4 and 5 illustrate steps 120, 130, 140 and 150 in moredetail.

In FIG. 3, step 310 provides that a new block of SNPs is chosen forevaluation. In one embodiment, the first block chosen contains only thefirst SNP in a SNP haplotype sequence; thus at step 320, the first,single, SNP is added to the block. At step 330, informativeness of thisblock is determined.

“Informativeness” of a SNP haplotype block is defined in one embodimentas the degree to which the block, provides information about geneticregions. For example, in one embodiment of the present invention,informativeness could be calculated as the ratio of the number of SNPlocations in a SNP haplotype block divided by the number of SNPsrequired to distinguish each SNP haplotype pattern under considerationfrom other SNP haplotype patterns under consideration (number ofinformative SNPs) in that block. Another measure of informativenessmight be the number of informative SNPs in the block. One skilled in theart recognizes that informativeness may be determined in any number ofways.

Referring again to FIG. 2, SNP haplotype block 261 contains three SNPsand two SNP haplotype patterns (AGA and TAT). Any one of the three SNPspresent can be used to tell the patterns apart; thus, any one of theseSNPs can be chosen to be the informative SNP for this. SNP haplotypepattern. For example, if it is determined that a sample nucleic acidcontains a T at the first position, the same sample will contain an A atthe second position and a T at the third position. If it is determinedin a second sample that the SNP in the second position is a G, the firstand third SNPs will be A's. Thus, by one measure of informativeness, theinformativeness value for this first block is 3:3 total SNPs divided by1 informative SNP needed to distinguish the patterns from each other.Similarly, SNP haplotype block 262 contains three SNPs (two positions donot have variants) and two haplotype patterns (TCG and CAC). As with thepreviously-analyzed block, any one of the three SNPs can be evaluated totell one pattern from the other; thus, the informativeness of this blockis 3:3 total SNPs divided by 1 informative SNP needed to distinguish thepatterns. SNP haplotype block 263 contains five SNPs and two SNPpatterns (TAACG and ATCAC). Again, any one of the five SNPs can be usedto tell one pattern from the other; thus, the informativeness of thisblock is 5:5 total SNPs divided by 1 informative SNP needed todistinguish the patterns.

FIG. 2 provides a simple example of genetic analysis. When several SNPhaplotype patterns are present in a block, it may be necessary to usemore than one SNP as informative SNPs. For example, in a case where ablock contains, for example, six SNPs and two SNPs are needed todistinguish the patterns of interest, the informativeness of the blockis 3:6 total SNPs divided by 2 SNPs needed to distinguish the patterns.Generally speaking, as many as 2^(N) distinct SNP haplotype patterns canbe distinguished by using the genotypes of N suitably selected SNPs.Therefore, if there exist only two SNP haplotype patterns in the SNPhaplotype block, a single SNP should be able to differentiate betweenthe two. If there are three or four patterns, at least two SNPs wouldlikely be required, etc.

In step 340 of FIG. 3, once the informativeness of a SNP haplotype blockis determined, a test is performed. The test essentially evaluates theSNP haplotype blocks based on selected criteria (for example, whether ablock meets a threshold measure of informativeness), and the result ofthe test determines whether, for example, another SNP will be added tothe block for analysis or whether the analysis will proceed with a newblock starting at a different SNP location. FIG. 4 illustrates oneembodiment of this process.

In FIG. 4, assume there is a DNA sequence with six SNP locations. Theanalysis of SNP haplotype blocks described above might be performed inthe following manner: SNP haplotype block A is selected containing onlythe SNP at SNP position 1 (steps 310 and 320 of FIG. 3). Theinformativeness of this block is calculated (step 330), and it isdetermined whether the informativeness of this block meets a thresholdmeasure of informativeness (step 340). In this case, it “passes” and twothings happen. First, this block of one SNP (SNP position 1) is added tothe set of candidate SNP haplotype blocks (step 350). Second, anotherSNP (here, SNP position 2) is added to this block (step 320) to create anew block, B, containing SNP positions 1 and 2, which is then analyzed.In this illustration block B also meets the threshold measure ofinformativeness (step 340), so it would be added to the set of candidateSNP haplotype blocks (step 350), and another SNP (here, SNP position 3)is added to this block (step 320) to create new block C, containing SNPpositions 1, 2 and 3, which is then analyzed. In this illustration, Calso meets the threshold measure of informativeness and it is added tothe set of candidate SNP haplotype blocks (step 350), and another SNP(here, SNP position 4) is added to this block (step 320) to create newblock D, containing SNP positions 1, 2, 3, and 4, which is then analyze.In the FIG. 4 illustration, SNP block D does not meet the thresholdmeasure of informativeness. SNP block D is not added to the set ofcandidate SNP haplotype blocks (step 350), nor does another SNP getadded to block D for analysis. Instead, a new SNP location is selectedfor a round of SNP block evaluations.

In FIG. 4, after block D fails to meet the threshold measure ofinformativeness, a new block, E, is selected that contains only the SNPat position 2. Block E is evaluated for informativeness, is found tomeet the threshold measure, is added to the set of candidate SNPhaplotype blocks (step 350), and another SNP (here, SNP position 3) isadded to this block (step 320) to create new block F, containing SNPpositions 2 and 3, which is then analyzed, and so on. Note that block Hfails to meet the threshold measure of informativeness, is not added tothe set of candidate SNP haplotype blocks (step 350), nor does anotherSNP get added to block H for analysis. Instead, a new block, I, isselected that contains only the SNP at position 3, and so on.

Once a set of candidate SNP blocks is constructed (step 350 of FIG. 3),analysis is performed on the set to select a final set of SNP blocks(step 160 of FIG. 1). The selection of the final set of SNP blocks canperformed in a variety of ways. For example, referring back to FIG. 4,one could select the largest block containing SNP position 1 that passesthe threshold test (block C, containing SNPs 1, 2 and 3), discard thesmaller blocks that contain the same SNPs (blocks A and B). Then thenext block selected might be the next block starting with SNP position 4that is the largest block that meets the threshold test forinformativeness (block G) and the smaller blocks that contain the sameSNPs (blocks E and F) would be discarded. Such a method would give a setof final, non-overlapping SNP haplotype blocks that span the genomicregion of interest, contain the SNPs of interest and that have a highlevel of informativeness. Thus, once all candidate SNP haplotype blocksare evaluated, the result may be, in a preferred embodiment, a set ofnon-overlapping SNP haplotype blocks that encompasses all the SNPs inthe original set. Some groups, called isolates, may consist of only asingle SNP, and by definition have an informativeness of 1. Other groupsmay consist of a hundred or more SNPs, and have an informativenessexceeding 30.

An alternative method for selecting a final set of SNP haplotype blocksis shown in FIGS. 5A and 5B. Looking first at FIG. 5A, in a first step510, the candidate SNP haplotype block set (generated, for example, bythe methods described in FIGS. 3 and 4 herein) is analyzed forinformativeness. In step 520, the candidate SNP haplotype block with thehighest informativeness in the entire candidate set is chosen to beadded to the final SNP haplotype block set (step 530). Once thiscandidate SNP haplotype block is chosen to be a member of the final SNPhaplotype block set, it is deleted from the candidate block set (step540), and all other candidate SNP haplotype blocks that overlap with thechosen block are deleted from the candidate SNP haplotype block set(step 550). Next, the candidate SNP haplotype blocks remaining in thecandidate set are analyzed for informativeness (step 510), and thecandidate SNP haplotype block with the highest informativeness is chosento be added to the final SNP haplotype block set (steps 520 and 530). Asbefore, once this SNP haplotype block is chosen to be a member of thefinal SNP haplotype block set, it is deleted from the candidate blockset (step 540), and all other candidate SNP haplotype blocks thatoverlap with the chosen block are deleted from the candidate SNPhaplotype block set (step 550). The process continues until a final setof non-overlapping SNP haplotype blocks that encompasses all the SNPs inthe original set is constructed.

FIG. 5B illustrates a simple employment of the method of selecting afinal set of SNP haplotype blocks described in FIG. 5A. In FIG. 5B, asequence 5′ to 3′ is analyzed for SNPs, SNP haplotype patterns andcandidate SNP haplotype blocks according to the methods of the presentinvention. Candidate SNP haplotype blocks contained within this sequenceare indicated by their placement under the sequence, and are designatedby a letter. In addition, after the letter, the informativeness of eachblock is indicated. For example, candidate SNP haplotype block A islocated at the extreme 5′ end of the sequence, and has aninformativeness of 1. Candidate SNP haplotype block R is located at theextreme 3′ end of the sequence, and has an informativeness of 2.

According to FIG. 5A, in a first step 510, the candidate SNP haplotypeblocks are analyzed for informativeness, and in step 520, the SNPhaplotype block with the highest informativeness is chosen to be addedto the final SNP haplotype block set (steps 520 and 530). In the case ofFIG. 5B, candidate SNP haplotype block M with an informativeness of 6would be the first candidate SNP haplotype block selected to be added tothe final SNP haplotype block set. Once SNP haplotype block M isselected, it is deleted or removed from the candidate set of SNPhaplotype blocks (step 540), and all other candidate SNP haplotypeblocks that overlap with SNP haplotype block M (blocks J, N, K, L, O andP) are deleted from the candidate SNP haplotype block set (step 550).Next, the remaining blocks of the candidate SNP haplotype block set,namely SNP haplotype blocks A, B, C, D, E, F, G, H, I, Q and R areanalyzed for informativeness, and in step 520, the remaining SNPhaplotype block with the highest informativeness, I, with aninformativeness of 5, is chosen to be added to the final SNP haplotypeblock set (530) and deleted or removed from the candidate set of SNPhaplotype blocks (step 540). Next, in step 550, all other candidate SNPhaplotype blocks that overlap with SNP haplotype block I, here, onlyblock H, is deleted from the candidate SNP haplotype block set. Again,the remaining blocks of the candidate SNP haplotype block set, namelySNP haplotype blocks A, B, C, D, E, F, G, Q and R are analyzed forinformativeness. In step 520, the remaining SNP haplotype block with thehighest informativeness, block F, with an informativeness of 4, ischosen to be added to the final SNP haplotype block set (530) anddeleted or removed from the candidate set of SNP haplotype blocks (step540). Next, all other candidate SNP haplotype blocks that overlap withSNP haplotype block F—here, blocks E, G, C and D—are deleted from thecandidate SNP haplotype block set, and the remaining blocks of thecandidate SNP haplotype block set, namely SNP haplotype blocks A, B, Qand R, are analyzed for informativeness, and so on.

Other methods can be employed to select a final set of SNP haplotypeblocks for analysis from the set of candidate SNP haplotype blocks (step160 of FIG. 1). For example, algorithms known in the art may be appliedfor this purpose. For example, shortest-paths algorithms may be used(see, generally, Cormen, Leiserson, and Rivest, Introduction toAlgorithms (MIT Press) pp. 514-78 (1994). In a shortest-paths problem, aweighted, directed graph G=(V,E), with weight function w:E→R mappingedges to real-valued weights is given. The weight of path p=(v₀, v₁, . .. v_(k)) is the sum of the weights of its constituent edges:

${w(p)} = {\sum\limits_{i = 1}^{k}{{w\left( {v_{i - 1},v_{i}} \right)}.}}$

The shortest-path weight from u to v is defined by δ(u,v) being equal tomin w(p):u→v if there is a path from u to v; otherwise, δ(u,v) is equalto infinity. A shortest path from vertex u to vertex v is then definedas any path p with weight w(p)=δ(u,v). Edge weights can be interpretedas various metrics: for example, distance, time, cost, penalties, loss,or any other quantity that accumulates linearly along a path that onewishes to minimize. In the embodiment of the shortest path algorithmused in applications of this invention, each SNP haplotype block wouldbe considered a “vertex” with an “edge” defined for each boundary of theblock. Each SNP haplotype block has a relationship to each other SNPhaplotype block, with a “cost” for each edge. Cost is determined byparameters of choice, such as overlap (or the extent thereof) of thevertices or gaps between the vertices.

Single-source shortest-paths problems focus on a given graph G=(V,E),where a shortest path from a given source vertex s ε V to every vertex vε V is determined. Additionally, variants of the single source algorithmmay be applied. For example, one may apply a single-destinationshortest-paths solution where a shortest path to a given destinationvertex t from every vertex v is found. Reversing the direction of eachedge in the graph reduces this problem to a single-source problem.Alternatively, one may apply a single-pair shortest-path problem wherethe shortest path from u to v for given vertices u and v is found. Ifthe single-source problem with source vertex u is solved, thesingle-source shortest path problem is solved as well. Also, theall-pairs shortest-paths approach may be employed. In this case, ashortest path from u to v for every pair of vertices u and v is found—asingle-source algorithm is run from each vertex.

One single-source shortest-path algorithm that may be employed in themethods of the present invention is Dijkstra's algorithm. Dijkstra'salgorithm solves the single-source shortest-paths problem on a weighted,directed graph G=(V,E) for the case in which all edge weights arenonnegative. Dijkstra's algorithm maintains a set of vertices, S, whosefinal shortest-path weights from a sources have already been determined.That is, for all vertices v being elements of S, d[v]=δ(s,v). Thealgorithm repeatedly selects the vertex u as an element of V-S with theminimum shortest-path estimate, inserts u into S, and relaxes all edgesradiating from u. In one implementation, a priority queue Q thatcontains all the vertices in V-S, keyed by their d values, ismaintained. This implementation assumes that graph G is represented byadjacency lists.

Dijkstra (G, w, s)

1 INITIALIZE-SINGLE SOURCE (G,s)

2 S←Ø

3 Q←V[G]

4 while Q≠Ø

5 do u←EXTRACT-MIN (Q)

6 S←S U {u}

7 for each vertex v ε Adj[u]

8 do RELAX (u,v,w)

Thus, G in this case is the graph of linear coverage of the genomicsequence being analyzed and S is the set of vertices selected. Once onevertex is selected that covers a particular area of the genomicsequence, other vertices that overlap this sequence can be discarded.

Other algorithms that may be used for selecting SNP haplotype blocksinclude a greedy algorithm (again, see, Cormen, Leiserson, and Rivest,Introduction to Algorithms (MIT Press) pp. 329-55 (1994)). A greedyalgorithm obtains an optimal solution to a problem by making a sequenceof choices. For each decision point in the algorithm, the choice thatseems best at the moment is chosen. This heuristic strategy does notalways produce an optimal solution. Greedy algorithms differ fromdynamic programming in that in dynamic programming, a choice is made ateach step, but the choice may depend on the solutions to subproblems. Ina greedy algorithm, whatever choice seems best at the moment is chosenand then subproblems arising after the choice is made are solved. Thus,the choice made by a greedy algorithm may depend on the choices madethus far, but cannot depend on any future choices or on the solutions tosubproblems. One variation of greedy algorithms is Huffman codes. AHuffman greedy algorithm constructs an optimal prefix code and thealgorithm builds a tree T corresponding to the optimal code in abottom-up manner. It begins with a set of |C| leaves and performs asequence of |C|−1 “merging” operations to create the final tree. Forexample, assuming C is a set of n characters and that each character c εC is an object with a defined frequency f[c]; a priority queue Q, keyedon f, is used to identify the two least-frequent objects to mergetogether. The result of the merger of two objects is a new object whosefrequency is the sum of the frequencies of the two objects that weremerged. For example:

-   1. n←|C|-   2. Q←C-   3. for i←1 to n−1-   4. do z←ALLOCATE-NODE( )-   5. x←left[z]←EXTRACT-MIN(Q)-   6. y←right[z]←EXTRACT-MIN(Q)-   7. f[z]←f[x]+f[y]-   8. INSERT (Q,z)-   9. return EXTRACT-MIN(Q)

Line 2 initializes the priority queue Q with the characters in C. Thefor loop in lines 3-8 repeatedly extracts the two nodes x and y oflowest frequency from the queue, and replaces them in the queue with anew node z representing their merger. The frequency of z is computed asthe sum of the frequencies of x and y in line 7. The node z has x as itsleft child and y as its right child. After n−1 mergers, the one nodeleft in the queue—the root of the code tree—is returned in line 9.

Again, these methods result in a set of final, non-overlapping SNPhaplotype blocks that encompasses all SNPs evaluated in a particulargenomic region. An important result of selecting SNPs, SNP haplotypeblocks and SNP haplotype patterns according to the methods of thepresent invention, is that in some embodiments during the calculation ofinformativeness of SNP haplotype blocks, informative SNPs for each SNPhaplotype block and pattern are determined. Informative SNPs allow fordata compression.

In one embodiment of the present invention, the selection of at leastlog₂ p SNPs from each group containing p patterns (rounding up to thenearest integer) provides one set of informative SNPs which areunusually powerful for predicting genotype/phenotype associations. Oneskilled in the art recognizes that in other analyses it is not necessaryto use spatially contiguous groups to determine such a subset. Forexample, in some embodiments of the present invention, it may bedesirable to identify sets of non-adjacent SNPs that statistically arepassed on in a fashion analogous to that of SNP haplotype blocks eventhough they are not spatially contiguous on the DNA strand.

In order to determine SNP haplotype blocks that will be used inassociation studies accurately (build an accurate baseline of SNPs andSNP haplotype blocks and patterns), it is necessary to examine more thana few individual DNA strands. FIG. 6 illustrates the importance ofexamining at least about five different DNA strands for determining SNPhaplotype blocks and for the selection of informative SNPs. The topportion of FIG. 6 illustrates the sequence of a hypothetical stretch ofDNA, with the variant positions indicated and variant block boundariesdrawn; however, SNP haplotype block boundaries would not be known abinitio. Sequencing results 610 show the results of sequencing haploidDNA of three individuals. As shown, in general it is possible to haveidentified a large fraction of the common SNPs after a relatively smallnumber of individuals have been sequence& In the case in FIG. 6, theSNPs at each location shown in the top portion of FIG. 6 have beenidentified, as indicated by check marks.

If, however, further individuals are not evaluated, the block boundarieswould not be correctly, identified at this stage. For example, while onecould at this stage draw block boundaries between blocks 620 and 630(note that the first C→G variant predicts the first G→A variant, and thefirst C→T variant predicts the second C→T variant), it is not possibleto distinguish between the blocks 630 and 640 at this stage. At thisstage it appears that the first C→T variant would predict the first andsecond T→A variants. Accordingly, a more statistically significantsample set is required to draw the block boundaries. For example, in themethods of the present invention, the number of DNA strands analyzed todetermine SNP haplotype blocks, SNP haplotype patterns, and/orinformative SNPs is a plurality, for example, at least about five or atleast about 10. In preferred embodiments, the number of DNA stands is atleast 16. In more preferred embodiments, the number of DNA strandsanalyzed to determine SNP haplotype blocks, SNP haplotype patterns,and/or informative SNPs is at least 25. However, once relevant SNPs havebeen identified (i.e., SNP discovery has been performed), it is possibleto genotype only the variant positions in the remaining samples tocomplete the process of identifying block boundaries without sequencingthe entire stretch of genomic DNA. For examples of such methods, seeU.S. Ser. No. 10/042,819, filed Jan. 6, 2002, attorney docket number1016N-1, entitled “Whole Genome Scanning”.

The results of performing a genotyping process on only the SNPs inanother hypothetical genomic sample are shown in FIG. 6 at 650. Asshown, by performing this additional genotyping step, it is now possibleto see that blocks 630 and 640 are distinguishable. Specifically, it isnow possible to see that the first C→T variant does not track with thefirst and second T→A variants, but instead, the first C→T variant can beused to predict only the second C→T variant (and vice versa) and thefirst T→A variant can be used only to predict the second T→A variant(and vice versa).

In addition to the aspects of the present invention described above, aspecific embodiment of the present invention is that it can be employedto resolve ambiguous SNP haplotype sequences for data analysis. Forexample, a SNP may be ambiguous because data from a gel sequencingoperation or array hybridization experiment does not give a clear result“Resolving” in this case may mean, e.g., resolving ambiguous SNPlocations in a SNP haplotype sequence by matching the SNP haplotypesequence to the SNP haplotype pattern to which the SNP haplotypesequence most closely relates. Additionally, “resolving” may meanremoving an ambiguous SNP haplotype sequence from data analysis.

In one embodiment of resolving ambiguous SNP haplotype sequences, SNPhaplotype sequences are placed in a data set for possible addition to apattern set. The data set will contain all SNP haplotype sequences thatare to be evaluated for possible assignment to a SNP haplotype pattern.Referring now to FIG. 7A, in step 710, the SNP haplotype sequences inthe data set are compared, one by one, to the pattern sequences in thepattern set. In some cases, there will be no patterns in the pattern setinitially, though in other cases some or all pattern sequences may beknown beforehand. In step 720, a query is made: is the SNP haplotypesequence from the data set consistent with a pattern sequence in thepattern set? If the answer is no, step 730 provides the SNP haplotypesequence being evaluated will be added to the pattern set. If the answeris yes, another query is made (740): is the SNP haplotype sequence fromthe data set consistent with more than one pattern sequence in thepattern set?

If the answer is yes, the SNP sequence from the data set may bediscarded or, in some embodiments, held for further or differentanalyses (step 750). If the answer to the second query is no, then, instep 760, the SNP sequence from the data set is compared to the patternsequence from the pattern set with which it is consistent. From thesetwo sequences, the SNP sequence with the least number of ambiguities isselected and placed in the pattern set (770). The SNP sequencecontaining the more ambiguities may be discarded, or, in someembodiments, held for further or different types of analyses.

The resolving process may be understood further by referring to FIGS. 7Aand 7B. In FIG. 7B, a first SNP sequence, TTCGA, is compared to thesequences contained in the pattern set (step 710). At this point, thereare no pattern sequences contained in the pattern set, thus TTCGA is notconsistent with any pattern sequence in the pattern set. This occurrenceof SNP sequence TTCGA is then removed from the data set (or is retainedfor different analyses), and added to the pattern set (730). The patternset now has one pattern sequence, TTCGA.

Looking again at FIG. 7B, the second SNP sequence in the data set,T?C??, is compared to the sequence contained in the pattern set (step710). Now there is one pattern sequence in the pattern set, TTCGA, andT?C?? is consistent with sequence (step 720). The answer to the secondquery (740), whether SNP sequence T?C?? is consistent with more than onepattern sequence in the pattern set, is no, as currently there is onlyone pattern sequence, TTCGA, in the pattern set. In step 760, T?C?? iscompared to TTCGA to determine which sequence has the more ambiguities.T?C?? clearly does; thus, TTCGA is retained in the pattern set (770) andT?C?? may be discarded or held for further analyses.

The third sequence of the data set in FIG. 7B is C????. C???? first iscompared to TTCGA (step 710), is found not to be consistent with TTCGA(720), and is thus added to the pattern set (730). The fourth sequencein FIG. 7B is CTACA. CTACA is compared to TTCGA and C???? (the patternsequences in the pattern set, step 710), and is found to be consistentwith C???? (720). The second query (740) now is made: is CTACAconsistent with both C???? and TTCGA? The answer is no, so C???? andCTACA are then compared (760) and the sequence with the least number ofambiguities, in this case, CTACA, is held in the pattern set and C????is discarded (removed from analysis), or held for further analyses(770).

The fifth SNP sequence in the data set in FIG. 7B is ?T??A. This SNPsequence is compared to pattern sequences TTCGA and CTACA (710) and isfound to be consistent with both TTCGA and CTACA. Thus, the answer toquery 740 is yes: ?T??A is consistent with more than one patternsequence in the pattern set. In step 750, SNP sequence ?T??A is held forfurther analysis or discarded (removed from analysis).

Another approach to resolving allows that if, for example, one patternsequence is CCATT? and a SNP sequence from the data set is C?ATTG, thesequences are “combined” to solve the ambiguities (CCATTG), and the“combined” sequence is added to the pattern set. Additional arrayhybridizations, sequencing or other techniques known in the art may beemployed to analyze ambiguous SNP nucleotide positions.

Association of Phenotypes with SNP Haplotypes Blocks and Patterns

The SNP haplotype blocks, SNP haplotype patterns and/or informative SNPsidentified may be used for a variety of genetic analyses. For example,once informative SNPs have been identified, they may be used in a numberof different assays for association studies. For example, probes may bedesigned for microarrays that interrogate these informative SNPs. Otherexemplary assays include, e.g., the Taqman assays and Invader assaysdescribed supra, as well as conventional PCR and/or sequencingtechniques.

In some embodiments, as shown in step 170 of FIG. 1, the haplotypepatterns identified may be used in the above-referenced assays toperform association studies. This may be accomplished by determininghaplotype patterns in individuals with the phenotype of interest (forexample, individuals exhibiting a particular disease or individuals whorespond in a particular manner to administration of a drug) andcomparing the frequency of the haplotype patterns in these individualsto the haplotype pattern frequency in a control group of individuals.Preferably, such SNP haplotype pattern determinations are genome-wide;however, it may be that only specific regions of the genome are ofinterest, and the SNP haplotype patterns of those specific regions areused. In addition to the other embodiments of the methods of the presentinvention disclosed herein, the methods additionally allow for the“dissection” of a phenotype. That is, a particular phenotype may resultfrom two or more different genetic bases. For example, obesity in oneindividual may be the result of a defect in Gene X, while the obesityphenotype in a different individual may be the result of mutations inGene Y and Gene Z. Thus, the genome scanning capabilities of the presentinvention allow for the dissection of varying genetic bases for similarphenotypes. Once specific regions of the genome are identified as beingassociated with a particular phenotype, these regions may be used asdrug discovery targets (step 180 of FIG. 1) or as diagnostic markers(step 190 of FIG. 1).

As described in the previous paragraph, one method of conductingassociation studies is to compare the frequency of SNP haplotypepatterns in individuals with a phenotype of interest to the SNPhaplotype pattern frequency in a control group of individuals. In apreferred method, informative SNPs are used to make the SNP haplotypepattern comparison. The approach of using informative SNPs hastremendous advantage over other whole genome scanning or genotypingmethods known in the art to date, for instead of reading all 3 billionbases of each individual's genome—or even reading the 3-4 million commonSNPs that may be found—only informative SNPs from a sample populationneed to be determined. Reading these particular, informative SNPsprovides sufficient information to allow statistically accurateassociation data to be extracted from specific experimental populations,as described above.

FIG. 8 illustrates an embodiment of one method of determining geneticassociations using the methods of the present invention. In step 800,the frequency of informative SNPs is determined for genomes of a controlpopulation. In step 810, the frequency of informative SNPs is determinedfor genomes of a clinical population. Steps 800 and 810 may be performedby using the aforementioned SNP assays to analyze the informative SNPsin a population of individuals. In step 820, the informative SNPfrequencies from steps 800 and 810 are compare. Frequency comparisonsmay be made, for example, by determining the minor allele frequency(number of individuals with a particular minor allele divided by thetotal number of individuals) at each informative SNP location in eachpopulation and comparing these minor allele frequencies. In step 830,the informative SNPs displaying a difference between the frequency ofoccurrence in the control versus clinical populations are selected foranalysis. Once informative SNPs are selected, the SNP haplotype blocksthat contain the informative SNPs are identified, which in turnidentifies the genomic region of interest (step 840). The genomicregions are analyzed by genetic or biological methods known in the art(step 850), and the regions are analyzed for possible use as drugdiscovery targets (step 860) or as diagnostic markers (step 870), asdescribed in detail below.

Uses of Identified Genomic Sequences

Once a genetic locus or multiple loci in the genome are associated witha particular phenotypic trait—for example, a disease susceptibilitylocus—the gene or genes or regulatory elements responsible for the traitcan be identified. These genes or regulatory elements may then be usedas therapeutic targets for the treatment of the disease, as shown instep 180 of FIG. 1 or step 860 of FIG. 8. The genomic sequencesidentified by the methods of the present invention may be genic ornongenic sequences. The term “gene” intended to mean the open readingframe (ORF) encoding specific polypeptides, intronic regions, as well asadjacent 5′ and 3′ non-coding nucleotide sequences involved in theregulation of expression of the gene up to about 10 kb beyond the codingregion, but possibly further in either direction. The ORFs of anidentified gene may affect the disease state due to their effect onprotein structure. Alternatively, the noncoding sequences of theidentified gene or nongenic sequences may affect the disease state byimpacting the level of expression or specificity of expression of aprotein. Generally, genomic sequences are studied by isolating theidentified gene substantially free of other nucleic acid sequences thatdo not include the genic sequence. The DNA sequences are used in avariety of ways. For example, the DNA may be used to detect or quantifyexpression of the gene in a biological specimen. The manner in whichcells are probed for the presence of particular nucleotide sequences iswell established in the literature and does not require elaborationhere, however, see, e.g., Sambrook, et al., Molecular Cloning: ALaboratory Manual (Cold Spring Harbor Laboratory, New York) (1989)

In addition, the sequence of the gene, including flanking promoterregions and coding regions, may be mutated in various ways known in theart to generate targeted changes in expression level, or changes in thesequence of the encoded protein, etc. The sequence changes may besubstitutions, insertions, translocations or deletions. Deletions mayinclude large changes, such as deletions of an entire domain or exon.Techniques for in vitro mutagenesis of cloned genes are known. Examplesof protocols for site specific mutagenesis may be found in Gustin, etal., Biotechniques 14:22 (1993); Barany, Gene 37:111-23 (1985);Colicelli, et al., Mol. Gen. Genet. 199:537-9 (1985); Prentki, et al.,Gene 29:303-13 (1984); Sambrook, et al., Molecular Cloning: A LaboratoryManual (Cold Spring Harbor Press) pp. 15.3-15.108 (1989); Weiner, etal., Gene 126:35-41 (1993); Sayers, et al., Biotechniques 13:592-6(1992); Jones and Winistorfer, Biotechniques 12:528-30 (1992); andBarton, et al., Nucleic Acids Res. 18:7349-55 (1990). Such mutated genesmay be used to study structure/function relationships of the proteinproduct, or to alter the properties of the protein that affect itsfunction or regulation.

The identified gene may be employed for producing all or portions of theresulting polypeptide. To express a protein product, an expressioncassette incorporating the identified gene may be employed. Theexpression cassette or vector generally provides a transcriptional andtranslational initiation region, which may be inducible or constitutive,where the coding region is operably linked under the transcriptionalcontrol of the transcriptional initiation region, and a transcriptionaland translational termination region. These control regions may benative to the identified gene, or may be derived from exogenous sources.

The peptide may be expressed in prokaryotes or eukaryotes in accordancewith conventional methods, depending upon the purpose for expression.For large scale production of the protein, a unicellular organism, suchas E. coli, B. subtilis, S. cerevisiae, insect cells in combination withbaculovirus vectors, or cells of a higher organism such as vertebrates,particularly mammals, e.g. COS 7 cells, may be used as the expressionhost cells. In many situations, it may be desirable to express the genein eukaryotic cells, where the gene will benefit from native folding andpost-translational modifications. Small peptides also can be synthesizedin the laboratory. With the availability of the protein or fragmentsthereof in large amounts, the protein may be isolated and purified inaccordance with conventional ways. A lysate may be prepared of theexpression host and the proteins or fragments thereof purified usingHPLC, exclusion chromatography, gel electrophoresis, affinitychromatography, or other purification techniques.

An expressed protein may, be used for the production of antibodies,where short fragments induce the expression of antibodies specific forthe particular polypeptide (monoclonal antibodies), and larger fragmentsor the entire, protein allow for the production of antibodies over thelength of the polypeptide (polyclonal antibodies). Antibodies areprepared in accordance with conventional ways, where the expressedpolypeptide or protein is used as an immunogen, by itself or conjugatedto known immunogenic carriers, e.g. KLH, pre-S HBsAg, other viral oreukaryotic proteins, or the like. Various adjuvants may be employed,with a series of injections, as appropriate. For monoclonal antibodies,after one or more booster injections, the spleen is isolated, thelymphocytes are immortalized by cell fusion and screened for highaffinity antibody binding. The immortalized cells, i.e, hybridomas,producing the desired antibodies may then be expanded. For furtherdescription, see Monoclonal Antibodies: A Laboratory Manual, Harlow andLane, eds. (Cold Spring Harbor Laboratories, Cold Spring Harbor, N.Y.)(1988). If desired, the mRNA encoding the heavy and light chains may beisolated and mutagenized by cloning in E. coli, and the heavy and lightchains mixed to further enhance the affinity of the antibody.Alternatives to in vivo immunization as a method of raising antibodiesinclude binding to phage “display” libraries, usually in conjunctionwith in vitro affinity maturation.

The identified genes, gene fragments, or the encoded protein or proteinfragments may be useful in gene therapy to treat degenerative and otherdisorder. For example, expression vectors may be used to introduce theidentified gene into a cell. Such vectors generally have convenientrestriction sites located near the promoter sequence to provide for theinsertion of nucleic acid sequences in a recipient genome. Transcriptioncassettes may be prepared comprising a transcription initiation region,the target gene or fragment thereof, and a transcriptional terminationregion. The transcription cassettes may be introduced into a variety ofvectors, e.g. plasmid; retrovirus, e.g. lentivirus; adenovirus; and thelike, where the vectors are able to be transiently or stably maintainedin the cells. The gene or protein product may be introduced directlyinto tissues or host cells by any number of routes, including viralinfection, microinjection, or fusion of vesicles. Jet injection may alsobe used for intramuscular administration, as described by Furth, et al.,Anal. Biochem, 205:365-68 (1992). Alternatively, the DNA may be coatedonto gold microparticles, and delivered intradermally by a particlebombardment device, or “gene gun” as described in the literature (see,for example, Tang, et al., Nature, 356:152-54 (1992)).

Antisense molecules can be used to down-regulate expression of theidentified gene in cells. The antisense reagent may be antisenseoligonucleotides, particularly synthetic antisense oligonucleotideshaving chemical modifications, or nucleic acid constructs that expresssuch antisense molecules as RNA. A combination of antisense moleculesmay be administered, where a combination may comprise multiple differentsequences.

As an alternative to antisense inhibitors, catalytic nucleic acidcompounds, e.g., ribozymes, anti-sense conjugates, etc., may be used toinhibit gene expression. Ribozymes may be synthesized in vitro andadministered to the patient, or may be encoded on an expression vector,from which the ribozyme is synthesized in the targeted cell (forexample, see International patent application WO 9523225, and Beigelman,et al., Nucl. Acids Res. 23:4434-42 (1995)). Examples ofoligonucleotides with catalytic activity are described in WO 9506764.Conjugates of antisense oligonucleotides with a metal complex, e.g.terpyridylCu(II), capable of mediating mRNA hydrolysis are described inBashkin, et al., Appl. Biochem. Biotechnol. 54:43-56 (1995).

In addition to using the identified sequences for gene therapy, theidentified nucleic acids can be used to generate genetically modifiednon-human animals to create animal models of diseases or to generatesite-specific gene modifications in cell lines for the study of proteinfunction or regulation. The term “transgenic” is intended to encompassgenetically modified animals having an exogenous gene that is stablytransmitted in the host cells where, for example, the gene may bealtered in sequence to produce a modified protein, or may be a reportergene operably linked to an exogenous promoter. Transgenic animals may bemade through homologous recombination, where the endogenous gene locusis altered, replaced or otherwise disrupted. Alternatively, a nucleicacid construct may be randomly integrated into the genome. Vectors forstable integration include plasmids, retroviruses and other animalviruses, YACs, and the like. Of interest are transgenic mammals, e.g.,cows, pigs, goats, horses, etc., and, particularly, rodents, e.g., rats,mice, etc.

Investigation of genetic function may also utilize non-mammalian models,particularly using those organisms that are biologically and geneticallywell-characterized, such as C. elegans, D. melanogaster and S.cerevisiae. The subject gene sequences may be used to knock-outcorresponding gene function or to complement defined genetic lesions inorder to determine the physiological and biochemical pathways involvedin protein function. Drug screening may be performed in combination withcomplementation or knock-out studies, e.g., to study progression ofdegenerative disease, to test therapies, or for drug discovery.

In addition, the modified cells or animals are useful in the study ofprotein function and regulation. For example, a series of smalldeletions and/or substitutions may be made in the identified gene todetermine the role of different domains in enzymatic activity, celltransport or localization, etc. Specific constructs of interest include,but are not limited to, antisense constructs to block gene expression,expression of dominant negative genetic mutations, and over-expressionof the identified gene. One may also provide for expression of theidentified gene or variants thereof in cells or tissues where it is notnormally expressed or at abnormal times of development. In addition, byproviding expression of a protein in cells in which it is not normallyproduced, one can induce changes in cellular behavior that provideinformation regarding the normal function of the protein.

Protein molecules may be assayed to investigate structure/functionparameters. For example, by providing for the production of largeamounts of a protein product of an identified gene, one can identifyligands or substrates that bind to, modulate or mimic the action of thatprotein product. Drug screening identifies agents that provide, e.g., areplacement or enhancement for protein function in affected cells, orfor agents that modulate or negate protein function. The term “agent” asused herein describes any molecule, e.g. protein or small molecule, withthe capability of altering, mimicking or masking, either directly orindirectly, the physiological function of an identified gene or geneproduct. Generally a plurality of assay mixtures are run in parallelwith different concentrations of the agent to obtain a differentialresponse to the various concentrations. Typically, one of theseconcentrations serves as a negative control, i.e., at zero concentrationor below the level of detection.

A wide variety of assays may be used for this purpose, including labeledin vitro protein-protein binding assays, protein-DNA binding assays,electrophoretic mobility shift assays, immunoassays for protein binding,and the like. Also, all or a fragment of the purified protein may beused for determination of three-dimensional crystal structure, which canbe used for determining the biological function of the protein or a partthereof, modeling intermolecular interactions, membrane fusion, etc.

Candidate agents encompass numerous chemical classes, though typicallythey are organic molecules or complexes, preferably small organiccompounds, having a molecular weight of more than 50 and less than about2,500 daltons. Candidate agents comprise functional groups necessary forstructural interaction with proteins, particularly hydrogen bonding, andtypically include at least an amine, carbonyl, hydroxyl or carboxylgroup, and frequently at least two of the functional chemical groups.The candidate agents often comprise cyclical carbon or heterocyclicstructures and/or aromatic or polyaromatic structures substituted withone or more of the above functional groups. Candidate agents are alsofound among biomolecules including, but not limited to peptides,saccharides, fatty acids, steroids, purines, pyrimidines, derivatives,structural analogs or combinations thereof.

Candidate agents are obtained from a wide variety of sources includinglibraries of synthetic or natural compounds. For example, numerous meansare available for random and directed synthesis of a wide variety oforganic compounds and biomolecules, including expression of randomizedoligonucleotides and oligopeptides. Alternatively, libraries of naturalcompounds in the form of bacterial, fungal, plant and animal extractsare available or readily produce. Additionally, natural or syntheticallyproduced libraries and compounds are readily modified throughconventional chemical, physical and biochemical means, and may be usedto produce combinatorial libraries. Known pharmacological agents may besubjected to directed or random chemical modifications, such asacylation, alkylation, esterification, amidification, etc., to producestructural analogs.

Where the screening assay is a binding assay, one or more of themolecules may be coupled to a label, where the label can directly orindirectly provide a detectable signal. Various labels includeradioisotopes, fluorescers, chemiluminescers, enzymes, specific bindingmolecules, particles, e.g., magnetic particles, and the like. Specificbinding molecules include pairs, such as biotin and streptavidin,digoxin and antidigoxin, etc. For the specific binding members, thecomplementary member would normally be labeled with a molecule thatprovides for detection, in accordance with known procedures.

A variety of other reagents may be included in the screening assay.These include reagents like salts, neutral proteins, e.g., albumin,detergents, etc that are used to facilitate optimal protein-proteinbinding and/or reduce non-specific or background interactions. Reagentsthat improve the efficiency of the assay, such as protease inhibitors,nuclease inhibitors, anti-microbial agents, etc., may be used.

Agents may be combined with a pharmaceutically acceptable carrier,including any and all solvents, dispersion media, coatings,anti-oxidant, isotonic and absorption delaying agents and the like. Theuse of such media and agents for pharmaceutically active substances iswell known in the art. Except insofar as any conventional media or agentis incompatible with the active ingredient, its use in the therapeuticcompositions and methods described herein is contemplated. Supplementaryactive ingredients can also be incorporated into the compositions.

The formulation may be prepared for use in various methods foradministration. The formulation may be given orally, by inhalation, ormay be injected, e.g. intravascular, intratumor, subcutaneous,intraperitoneal, intramuscular, etc. The dosage of the therapeuticformulation will vary widely, depending upon the nature of the disease,the frequency of administration, the manner of administration, theclearance of the agent from the host, and the like. The initial dose maybe larger, followed by smaller maintenance doses. The dose may beadministered as infrequently as once, weekly or biweekly, orfractionated into smaller doses and administered daily, semi-weekly,etc., to maintain an effective dosage level. In some cases, oraladministration will require a different dose than if administeredintravenously. Identified agents of the invention can be incorporatedinto a variety of formulations for therapeutic administration. Moreparticularly, the complexes can be formulated into pharmaceuticalcompositions by combination with appropriate, pharmaceuticallyacceptable carriers or diluents, and may be formulated into preparationsin solid, semi-solid, liquid or gaseous forms, such as tablets,capsules, powders, granules, ointments, solutions, suppositories,injections, inhalants, gels, microspheres, and aerosols. As such,administration of the agents can be achieved in various ways. Agents maybe systemic after administration or may be localized by the use of animplant that acts to retain the active dose at the site of implantation.

The following methods and excipients are merely exemplary and are in noway limiting. For oral preparations, an agent can be used alone or incombination with appropriate additives to make tablets, powders,granules or capsules, for example, with conventional additives, such aslactose, mannitol, corn starch or potato starch; with binders, such ascrystalline cellulose, cellulose derivatives, acacia, corn starch orgelatins; with disintegrators, such as corn starch, potato starch orsodium carboxymethylcellulose; with lubricants, such as talc ormagnesium stearate; and if desired, with diluents, buffering agents,moistening agents, preservatives and flavoring agents.

Additionally, agents may be formulated into preparations for injectionsby dissolving, suspending or emulsifying them in an aqueous ornonaqueous solvent, such as vegetable or other similar oils, syntheticaliphatic acid glycerides, esters of higher aliphatic acids or propyleneglycol; and if desired, with conventional additives such assolubilizers, isotonic agents, suspending agents, emulsifying agents,stabilizers and preservatives. Further, agents may be utilized inaerosol formulation to be administered via inhalation. The agentsidentified by the present invention can be formulated into pressurizedacceptable propellants such as dichlorodifluoromethane, propane,nitrogen and the like. Alternatively, agents may be made intosuppositories by mixing with a variety of bases such as emulsifyingbases or water-soluble bases. Further, identified agents of the presentinvention can be administered rectally via a suppository. Thesuppository can include vehicles such as cocoa butter, carbowaxes andpolyethylene glycols, which melt at body temperature, yet are solidifiedat room temperature.

Implants for sustained release formulations are well-known in the art.Implants are formulated as microspheres, slabs, etc. with biodegradableor non-biodegradable polymers. For example, polymers of lactic acidand/or glycolic acid form an erodible polymer that is well-tolerated bythe host. The implant containing identified agents of the presentinvention may be placed in proximity to the site of action, so that thelocal concentration of active agent is increased relative to the rest ofthe body. Unit dosage forms for oral or rectal administration such assyrups, elixirs, and suspensions may be provided wherein each dosageunit, for example, teaspoonful, tablespoonful, gel capsule, tablet orsuppository, contains a predetermined amount of the compositions of thepresent invention. Similarly, unit dosage forms for injection orintravenous administration may comprise the compound of the presentinvention in a composition as a solution in sterile water, normal salineor another pharmaceutically acceptable carrier. The specifications forthe novel unit dosage forms of the present invention depend on theparticular compound employed and the effect to be achieved, and thepharmacodynamics associated with each active agent in the host.

The pharmaceutically acceptable excipients, such as vehicles, adjuvants,carriers or diluents, are readily available to the public. Moreover,pharmaceutically acceptable auxiliary substances, such as pH adjustingand buffering agents, tonicity adjusting agents, stabilizers, wettingagents and the like, are readily available to the public.

A therapeutic dose of an identified agent is administered to a hostsuffering from a disease or disorder. Administration may be topical,localized or systemic, depending on the specific disease. The compoundsare administered at an effective dosage such that over a suitable periodof time the disease progression may be substantially arrested. It iscontemplated that the composition will be obtained and used under theguidance of a physician for in vivo use. The dose will vary depending onthe specific agent and formulation utilized, type of disorder, patientstatus, etc., such that it is sufficient to address the disease orsymptoms thereof, while minimizing side effects. Treatment may be forshort periods of time, e.g., after trauma, or for extended periods oftime, e.g., in the prevention or treatment of schizophrenia.

The SNPs identified by the present invention may be used to analyze theexpression pattern of an associated, gene and the expression patterncorrelated to a phenotypic trait of the organism such as diseasesusceptibility or drug responsiveness. The expression pattern in varioustissues can be determined and used to identify ubiquitous expressionpatterns, tissue specific expression patterns, temporal expressionpatterns and expression patterns induced by various external stimulisuch as chemicals or electromagnetic radiation. Such determinationswould provide information regarding function of the gene and/or itsprotein product.

The newly identified sequences also may be used as diagnostic markers,i.e., to predict a phenotypic characteristic such as diseasesusceptibility or drug responsiveness. In addition, the methods of thepresent invention may be used to stratify populations for clinicalstudies. As such, the genes or fragments thereof may be used as probesto determine whether the same nucleic acid sequence is present in thegenome of an organism being tested. In addition, the probes may be usedto monitor RNA or mRNA, levels within the organism to be tested or apart thereof, such as a specific tissue or organ, so as to determine theexpression level of the marker where the expression level can becorrelated to a particular phenotypic characteristic of the organism.Likewise, the marker may be assayed at the protein level using anycustomary technique such as immunological methods—Western blots,radioimmune precipitation and the like—or activity based assaysmeasuring an activity associated with the gene product. Moreover, when aphenotype cannot clearly distinguish between similar diseases havingdifferent genetic bases, the methods of the present invention can beused to identify correctly the disease.

Also, it should be apparent that the methods of the present inventioncan be used on organisms aside from humans. For example, when theorganism is an animal, the methods of the invention may be used toidentify loci associated, e.g., with disease resistance/orsusceptibility, environmental tolerance, drug response or the like, andwhen the organism is a plant, the method of the invention may be used toidentify loci associated with disease resistance/or susceptibility,environmental tolerance and or herbicide resistance.

It is to be understood that this invention is not limited to theparticular methodology, protocols, cell lines, animal species or genera,and reagents described, as such may vary. It is also to be understoodthat the terminology used herein is, for the purpose of describingparticular embodiments only, and is not intended to limit the scope ofthe present invention, which will be limited only by the appendedclaims.

Databases

The present invention includes databases containing informationconcerning variations, for instance, information concerning SNPs, SNPhaplotype blocks, SNP haplotype patterns and informative SNPs. In someembodiments, the databases of the present invention may compriseinformation on one or more haplotype patterns associated with one ormore phenotypic traits. Databases may also contain informationassociated with a given variation such as descriptive information aboutthe general genomic region in which the variation occurs, such aswhether the variation is located in a known gene, whether there areknown genes, gene homologs or regulatory regions nearby and the like.

Other information that may be included in the databases of the presentinvention include, but are not limited to, SNP sequence information,descriptive information concerning the clinical status of a tissuesample analyzed for SNP haplotype patterns, or the clinical status ofthe patient from which the sample was derived. The database may bedesigned to include different parts, for instance a variation database,a SNP database, a SNP haplotype block or SNP haplotype pattern databaseand an informative SNP database. Methods for the configuration andconstruction of databases are widely available, for instance, seeAkerblom et al., (1999) U.S. Pat. No. 5,953,727, which is hereinincorporated by reference in its entirety.

The databases of the invention may be linked to an outside or externaldatabase. FIG. 9 shows an exemplary computer network that issuitable;for the databases and executing the software of the presentinvention. A computer workstation 902 is connected with theapplication/data server(s) 906 through a local area network (LAN), suchas an ethernet 905. A printer 904 may be connected directly to theworkstation or to the Ethernet 905. The LAN may be connected to a widearea network (WAN), such as the internet 908 via a gateway server 907which may also serve as a firewall between the WAN 908 and the LAN 905.In preferred embodiments, the workstation may communicate with outsidedata sources, such as The SNP Consortium (TSC) or the National Centerfor Biotechnology Information 909, through the internet 908.

Any appropriate computer platform may be used to perform the necessarycomparisons between SNP haplotype blocks or patterns, associatedphenotypes, any other information in the database or informationprovided as an input. For example, a large number of computerworkstations are available from a variety of manufacturers, such hasthose available from Silicon Graphics. Client-server environments,database servers and networks are also widely available and areappropriate platforms for the databases of the invention.

The databases of the invention may also be used to present informationidentifying the SNP haplotype pattern in an individual and such apresentation may be used to predict one or more phenotypic traits of theindividual. Such methods may be used to predict the diseasesusceptibility/resistance and/or drug response of the individual.Further, the databases of the present invention may comprise informationrelating to the expression level of one or more of the genes associatedwith the variations of the invention.

The following examples describe specific embodiments of the presentinvention and the materials and methods are illustrative of theinvention and are not intended to limit the scope of the invention.

Example 1 Preparation of Somatic Cell Hybrids

Standard procedures in somatic cell genetics were used to separate humanDNA strands (chromosomes) from a diploid state to a haploid state. Inthis case, a diploid human lymphoblastoid cell line that was wildtypefor the thymidine kinase gene was fused to a diploid hamster fibroblastcell line containing a mutation in the thymidine kinase gene. Asub-population of the resulting cells were hybrid cells containing humanchromosomes. Hamster cell line A23 cells were pipetted into a centrifugetube containing 10 ml DMEM in which 10% fetal bovine serum (FBS)+1×Pen/Strep+10% glutamine were added, centrifuged at 1500 rpm for 5minutes, resuspended in 5 ml of RPMI and pipetted into a tissue cultureflask containing 15 ml RPMI medium. The lymphoblastoid cells were grownat 37° C. to confluence. At the same time, human lymphoblastoid cellswere pipetted into a centrifuge tube containing 10 ml RPMI in which 15%FBCS+1× Pen/Strep+10% glutamine were added, centrifuged at 1500 rpm for5 minutes, resuspended in 5 ml of RPMI and pipetted into a tissueculture flask containing 15 ml RPMI. The lymphoblastoid cells were grownat 37° C. to confluence.

To prepare the A23 hamster cells, the growth medium was aspirated andthe cells were rinsed with 10 ml PBS. The cells were then trypsinizedwith 2 ml of trypsin, divided onto 3-5 plates of fresh medium (DMEMwithout HAT) and incubated at 37° C. The lymphoblastoid cells wereprepared by transferring the culture, into a centrifuge tube andcentrifuging at 1500 rpm for 5 minutes, aspirating the growth medium,resuspending the cells in 5 ml RPMI and pipetting 1 to 3 ml of cellsinto 2 flasks containing 20 ml RPMI.

To achieve cell fusion, approximately 8-10×10⁶ lymphoblastoid cells werecentrifuged at 1500 rpm for 5 min. The cell pellet was then rinsed withDMEM by resuspending the cells, centrifuging them again and aspiratingthe DMEM. The lymphoblastoid cells were then resuspended in 5 ml freshDMEM. The recipient A23 hamster cells had been grown to confluence andsplit 3-4 days before the fusion and were, at this point, 50-80%confluent. The old media was removed and the cells were rinsed threetimes with DMEM, trypsinized, and finally suspended in 5 ml DMEM. Thelymphoblastoid cells were slowly pipetted over the recipient A23 cellsand the combined culture was swirled slowly before incubating at 37° C.for 1 hour. After incubation, the media was gently aspirated from theA23 cells, and 2 ml room temperature PEG 1500 was added by touching theedge of the plate with a pipette and slowly adding PEG to the platewhile rotating the plate with the other hand. It took approximately oneminute to add all the PEG in one full rotation of the plate. Next, 8 mlDMEM was added down the edge of the plate while rotating the plateslowly. The PEG/DMEM mixture was aspirated gently from the cells andthen 8 ml DMEM was used to rinse the cells. This DMEM was removed and 10ml fresh DMEM was added and the cells were incubated for 30 min. at 37°C. Again the DMEM was aspirated from the cells and 10 ml DMEM in which10% FBCS and 1× Pen/Strep were added, was added to the cells, which werethen allowed to incubate overnight.

After incubation, the media was aspirated and the cells were rinsed withPBS. The cells were then trypsinized and divided among plates containingselection media (DMEM in which 10% FBS+1× Pen/Strep+1× HAT were added)so that each plate received approximately 100,000 cells. The media waschanged on the third day following plating. Colonies were picked andplaced into 24-well plates upon becoming visible to the naked eye (day9-14). If a picked colony was confluent within 5 days, it was deemedhealthy and the cells were trypsinized and moved to a 6-well plate.

DNA and stock hybrid cell cultures were prepared from the cells from the6-well plate cultures. The cells were trypsinized and divided between a100 mm plate containing 10 ml selection media and an Eppendorf tube. Thecells in the tube were pelleted, resuspended 200 μl PBX and DNA wasisolated using a Qiagen DNA mini kit at a concentration of <5 millioncells per spin column. The 100 mm plate was grown to confluence, and thecells were either continued in culture or frozen.

Example 2 Selecting Haploid Hybrids

Scoring for the presence, absence and diploid/haploid state of humanchromosomes in each hybrid was performed using the Affymetrix, HuSNPgenechip (Affymetrix, Inc., of Santa Clara, Calif., HuSNP Mapping Assay,reagent kit and user manual, Affymetrix Part No 900194), which can score1494 markers in a single chip hybridization. As controls, the hamsterand human diploid lymphoblastoid cell lines were screened using theHuSNP chip hybridization assay. Any SNPs which were heterozygous in theparent lymphoblastoid diploid cell line were scored for haploidy in eachfusion cell line. Assume that “A” and “B” are alternative variants ateach SNP location. By comparing the markers that were present as “AB”heterozygous in the parent diploid cell line to the same markers presentas “A” or “B” (hemizygous) in the hybrids, the human DNA strands whichwere in the haploid state in each hybrid line was determined.

FIG. 11 shows results after two human/hamster cell hybrids (Hybrid 1 andHybrid 2) are tested for selected markers on human chromosome 21. Thefirst column lists the HuSNP chip marker designations. The second columnreports whether a signal was obtained when the hamster cell nucleic acid(no fusion) was used for hybridization with a HuSNP chip. As expected,there was no signal for any marker in the hamster cell sample. The thirdcolumn reports which variants for each marker were detected (“A”, “B” or“AB”) in the diploid parent human lymphoblastoid cell line, CPD17. Insome instances, only an A variant was present, in some instances only aB variant was present, and in some cases the CPD17 cells wereheterozygous (“AB”) for the variants. The last two columns report theresult when nucleic acid samples from two human/hamster hybrids (Hybrid1 and Hybrid 2) are hybridized with the HuSNP chip. Note in cases whereonly A variants were present in the parent CPD17 cell line, only Avariants were transferred in the fusion. In cases where only B variantswere present in the parent CPD17 cell line, only B variants weretransferred in the fusion. In cases where the CPD17 cell line washeterozygous, an A variant was transferred to some fusion clones, and aB variant was transferred to other fusion clones. It should beunderstood, however, that often only portions of chromosomes are presentin the hybrid cell lines resulting from this fusion process, that somehybrids may be diploid for some human chromosomes or portions thereof,that some hybrids may be haploid for other human chromosomes or portionsthereof, and some hybrids may not have either variant of somechromosomes. Hybrids containing only one variant of a particular humanchromosome (for instance, chromosome 21) were selected for analysis.Even more preferably, hybrids containing a whole chromosome (as opposedto only a portion thereof) were selected for analysis.

Example 3 Long Range PCR

DNA from the hamster/human cell hybrids was used to perform long-rangePCR assays. Long range PCR assays are known generally in the art andhave been described, for example, in the standard long range PCRprotocol from the Boehringer Mannheim Expand Long Range PCR Kit,incorporated herein by reference or all purposes.

Primers used for the amplification reactions were designed in thefollowing way: a given sequence, for example the 23 megabase contig onchromosome 21, was entered into a software program known in the artherein called “repeat masker” which recognizes sequences that arerepeated in the genome (e.g., Alu and Line elements) (see, A. F. A. Smitand P. Green, www.genome.washington.edu/uwgc/analysistools/repeatmask,incorporated herein by reference). The repeated sequences were “masked”by the program by substituting each specific nucleotide of the repeatedsequence (A, T, G or C) with “N”. The sequence output after this repeatmask substitution was then fed into a commercially available primerdesign program (Oligo 6.23) to select primers that were greater than 30nucleotides in length and had melting temperatures of over 65° C. Thedesigned primer output from Oligo 6.23 was then fed into a program whichthen “chose” primer pairs which would PCR amplify a given region of thegenome but have minimal overlap with the adjacent PCR products. Thesuccess rate for long range PCR using commercially available protocolsand this primer design was at least 80%, and greater than 95% successwas achieved on some portions of human chromosomes.

An illustrative protocol for long range PCR uses the Expand LongTemplate PCR System from Boehringer Mannheim Cat. #1681 834, 1681 842,or 1759 060. In the procedure each 50 μL PCR reaction requires twomaster mixes. In a specific example, Master Mix 1 was prepared for eachreaction in 1.5 ml microfuge tubes on ice and includes a final volume of19 μL of Molecular Biology Grade Water (Bio Whittaker, Cat. #16-001Y);2.5 μL 10 mM dNTP set containing dATP, dCTP, dGTP, and dTTP at 10 mMeach (Life Technologies Cat. #10297-018) for a final concentration of400 μM of each dNTP; and 50 ng DNA template.

Master Mix 2 for all reactions was prepared and kept on ice. For eachPCR reaction Master Mix 2 includes a final volume of 25 μL of MolecularBiology Grade Water (Bio Whittaker); 5 μL 10× PCR buffer 3 containing22.50 mM MgCl₂ (Sigma, Cat. #M 10289); 2.5 μL 10 mM MgCl₂ (for a finalMgCl₂ concentration of 2.75 mM); and 0.75 μL enzyme mix (added last)

Six microliters of premixed primers (containing 2.5 μL of Master Mix 1)were added to appropriate tubes, then 25 μL of Master Mix 2 was added toeach tube. The tubes were capped, mixed, centrifuged briefly andreturned to ice. At this point, the PCR cycling was begun according tothe following program: step 1: 94° C. for 3 min to denature template;step 2: 94° C. for 30 sec; step 3: annealing for 30 sec at a temperatureappropriate for the primers used; step 4: elongation at 68° C. for 1min/kb of product; step 5: repetition of steps 2-4 38 times for a totalof 39 cycles; step 6: 94° C. for 30 sec; step 7: annealing for 30 sec;step 8: elongation at 68° C. for 1 min/kb of product plus 5 additionalminutes; and step 9: hold at 4° C. Alternatively, a two-step PCR wouldbe performed: step 1: 94° C. for 3 min to denature template; step 2: 94°C. for 30 sec; step 3: annealing and elongation at 68° C. for 1 min/kbof product; step 4: repetition of steps 2-3 38 times for a total of 39cycles; step 5: 94° C. for 30 sec; step 6: annealing and elongation at68° C. for 1 min/kb of product plus 5 additional minutes; and step 7:hold at 4° C.

Results of the long range PCR amplification reaction for various regionson human chromosomes 14 and 22 were visualized on ethidiumbromide-stained agarose gels (FIG. 12). The long range PCR amplificationmethods of the present invention routinely produced amplified fragmentshaving an average size of about 8 kb, and appeared to fail to amplifygenomic regions in only rare cases (see G11 on the chromosome 22 gel).

Example 4 Wafer Design, Manufacture, Hybridization and Scanning

The set of oligonucleotide probes to be contained on an oligonucleotidearray (chip or wafer) was defined based on the human DNA strand sequenceto be queried. The oligonucleotide sequences were based on consensussequences reported in publicly available databases. Once the probesequences were defined, computer algorithms were used to designphotolithographic masks for use in manufacturing the probe-containingarrays. Arrays were manufactured by a light-directed chemical synthesisprocesses which combines solid-phase chemical synthesis withphotolithographic fabrication techniques. See, for example, WO 92/10092,or U.S. Pat. Nos. 5,143,854; 5,384,261; 5,405,783; 5,412,087; 5,424,186;5,445,934; 5,744,305; 5,800,992; 6,040,138; 6,040,193, all of which areincorporated herein by reference in their entireties for all purposes.Using a series of photolithographic masks to define exposure sites onthe glass substrate (wafer) followed by specific chemical synthesissteps, the process constructed high-density areas of oligonucleotideprobes on the array, with each probe in a predefined position. Multipleprobe regions were synthesized simultaneously and in parallel.

The synthesis process involved selectively illuminating aphoto-protected glass substrate by passing light through aphotolithographic mask wherein chemical groups in unprotected areas wereactivated by the light. The selectively-activated substrate wafers werethen incubated with a chosen nucleoside, and chemical coupling occurredat the activated positions on the wafer. Once coupling took place, a newmask pattern was applied and the coupling step was repeated with anotherchosen nucleoside. This process was repeated until the desired set ofprobes was obtained. In one specific example, 25-mer oligonucleotideprobes were used, where the thirteenth base was the base to be queried.Four probes were used to interrogate each nucleotide present in eachsequence—one probe complementary to the sequence and three mismatchprobes identical to the complementary probe except for the thirteenthbase. In some cases, at least 10×10⁶ probes were present on each array.

Once fabricated, the arrays were hybridized to the products from thelong range PCR reactions performed on the hamster-human cell hybrids.The samples to be analyzed were labeled and incubated with the arrays toallow hybridization of the sample to the probes on the wafer.

After hybridization, the array was inserted into a confocal, highperformance scanner, where patterns of hybridization were detected. Thehybridization data were collected as light emitted from fluorescentreporter groups already incorporated into the PCR products of thesample, which was bound to the probes. Sequences present in the samplethat are complimentary to probes on the wafer hybridized to the wafermore strongly and produced stronger signals than those sequences thathad mismatches. Since the sequence and position of each probe on thearray was known, by complementarity, the identity of the variation inthe sample nucleic acid applied to the probe array was identified.Scanners and scanning techniques used in the present invention are knownto those skilled in the art and are disclosed in, e.g., U.S. Pat. No.5,981,956 drawn to microarray chips, U.S. Pat. No 6,262,838 and U.S.Pat. No. 5,459,325. U.S. Ser. No. In addition, 60/223,278 filed on Aug.3, 2000, and non-provisional application claiming priority to U.S. Ser.No. 60/223,278 filed on Aug. 3, 2001, drawn to scanners and techniquesfor whole wafer scanning, are also incorporated herein by reference intheir entireties for all purposes.

Example 5 Determination of SNP Haplotypes on Human Chromosome 21

Twenty independent copies of chromosome 21, representing African, Asian,and Caucasian chromosomes were analyzed for SNP discovery and haplotypestructure. Two copies of chromosome 21 from each individual werephysically separated using a rodent-human somatic cell hybrid technique(FIG. 10), discussed supra. The reference sequence for the analysisconsisted of human chromosome 21 genomic DNA sequence consisting of32,397,439 bases. This reference sequence was masked for repetitivesequences and the resulting 21,676,868 bases (67%) of unique sequencewere assayed for variation with high density oligonucleotide arrays.Eight unique oligonucleotides, each 25 bases in length, were used tointerrogate each of the unique sample chromosome 21 bases, for a totalof 1.7×10⁸ different oligonucleotides. These oligonucleotides weredistributed over a total of eight different wafer designs using apreviously described tiling strategy (Chee, et al., Science 274:610(1996)). Light-directed chemical synthesis of oligonucleotides wascarried out on 5 inch×5 inch glass wafers purchased from Affymetrix,Inc. (Santa Clara, Calif.).

Unique oligonucleotides were designed to generate 3253 minimallyoverlapping longe range PCR (LRPCR) products of 10 kb average lengthspanning 32.4 Mb of contiguous chromosome 21 DNA, and were prepared asdescribed supra. For each wafer hybridization, corresponding LRPCRproducts were pooled and were purified using Qiagen tip 500 (Qiagen). Atotal of 280 μg of purified DNA was fragmented using 37 μl of 10×One-Phor-All buffer PLUS (Promega) and 1 unit of DNAase (LifeTechnolgies/Invitrogen) in 370 μl total volume at 37° C. for 10 minfollowed by heat inactivation at 99° C. for 10 min. The fragmentedproducts were end labeled using 500 units of Tdt (Boehringer Manheim)and 20 nmoles of biotin-N6-ddATP (DuPont NEN) at 37° C. for 90 min andheat inactivated at 95° C. for 10 min. The labeled samples werehybridized to the wafers in 10 mM Tris-HCL (pH 8), 3MTetramethylammonium chloride, 0.01% Tx-100, 10 μg/ml denatured herringsperm DNA in a total volume of 14 ml per wafer at 50° C. for 14-16hours. The wafers were rinsed briefly in 4× SSPE, washed three times in6× SSPE for 10 min each, stained using streptavidin R-phycoerythrin(SAPE, 5 ng/ml) at room temp for 10 min. The signal was amplified bystaining with an antibody against streptavidin (1.25 ng/ml) and byrepeating the staining step with SAPE.

PCR products corresponding to the bases present on a single wafer werepooled and hybridized to the wafer as a single reaction. In total,3.4×10⁹ oligonucleotides were synthesized on 160 wafers to scan 20independent copies of human chromosome 21 for DNA sequence variation.Each unique chromosome 21 was amplified from a rodent-human hybrid cellline by using long range PCR. LRPCR assays were designed using Oligo6.23 primer design software with high-moderate stringency parameters.The resulting primers were typically 30 nucleotides in length with themelting temperature of >65° C. The range of amplicon size was from 3kb-14 kb. A primer database for the entire chromosome was generated andsoftware (pPicker) was utilized to choose a minimal set of non-redundantprimers that yield maximum coverage of chromosome 21 sequence with aminimal overlap between adjacent amplicons. Alternatively, the primerselection method described in Example 3, herein, was employed. LRPCRreactions were performed using the Expand Long Template PCR Kit(Boehringer Mannheim) with minor modifications. The wafers were scannedusing a custom built confocal scanner.

SNPs were detected as altered hybridization by using a patternrecognition algorithm. A combination of previously described algorithms(Wang, et al., Science 280:1077 (1998)), was used to detect SNPs basedon altered hybridization patterns. In total, 35,989 SNPs were identifiedin the sample of twenty chromosomes. The position and sequence of thesehuman polymorphisms have been deposited in GenBank's SNPdb. Dideoxysequencing was used to assess a random sample of 227 of these SNPs inthe original DNA samples, confirming 220 (97%) of the SNPs assayed. Inorder to achieve this low rate of 3% false positive SNPs, stringentthresholds were required for SNP detection on wafers that resulted in ahigh false negative rate. Approximately 65% of all bases present on thewafers yielded data of high enough quality for use in SNP detection with35% being discarded as being false negatives. Consistent failure of longrange PCR in all samples analyzed accounts for 15% of the 35% falsenegative rate. The remaining 20% false negatives are distributed betweenbases that never yield high quality data (10%) and bases that yield highquality data in only a fraction of the 20 chromosomes analyzed (10%). Ingeneral, it is the sequence context of a base that dictates whether ornot it will yield high quality data. The finding that approximately 20%of all bases give consistently poor data is very similar to the findingthat approximately 30% of bases in single dideoxy sequencing reads of500 bases have quality scores too low for reliable SNP detection(Altschuler, et al., Nature 407:513 (2000)). The power to discover rareSNPs as compared to more frequent SNPs is disproportionately reduced mcases where only a limited number of the samples analyzed yield highquality data for a given base. As a result, SNP discovery by this methodis biased in favor of common SNPs.

FIG. 13A shows the distribution of minor allele frequencies of all35,989 SNPs discovered m the sample of globally diverse chromosomes.Genetic variation, normalized for the number of chromosomes in thesample, was estimated with two measures of nucleotide diversity: π theaverage heterozygosity per site and θ the population mutation parameter(see Hartl and Clark, Principles of Population Genetics (Sinauer, Mass.,1997)). The 32,397,439 bases of finished genomic chromosome 21 DNA weredivided into 200,000 base pair segments, and the high-quality base pairsused for SNP discovery in each segment were examined. The observedheterozygosity of these bases was used to calculate an averagenucleotide diversity (π) for each segment. The estimates of averagenucleotide diversity for the total data set (π=0.000723 and θ=0.000798),as well as, the distribution of nucleotide diversity, measured incontiguous 200,000 base pair bins of chromosome 21 (FIG. 13B), arewithin the range of values previously described (The International SNPMap Working Group, Nature 409:928-33 (2001)).

The extent of overlap of 15,549 chromosome 21 SNPs discovered by The SNPConsortium (TSC) was compared with the SNPs found in this study. Of theTSC SNPs, 5,087 were found to be in repeated DNA and were not tiled onthe wafers. Of the remaining 10,462 TSC SNPs, 4705 (45%) wereidentified. The estimate of θ was observed to be greater than theestimate of π for 129 of the 162 200-kb bins of contiguous DNA sequenceanalyzed. This difference is consistent with a recent expansion of thehuman population and is similar to the finding of a recent study ofnucleotide diversity in human genes (Stephens, et al., Science 293:489(2001)). It was found that 11,603 of the SNPs (32%) had a minor alleleobserved a single time in the sample (singletons), as compared with theneutral model expectation of 43% singletons given the observed amount ofnucleotide diversity (Fu and Li, Genetics 133:693 (1993)). Thedifference between the observed and expected values is likelyattributable to the reduced power to identify rare as compared to commonSNPs in this study as discussed above.

Over all, 47% of the 53,000 common SNPs with an allele frequency of 10%or greater estimated to be present in 32.4 Mb of the human genome wereidentified. This compares with an estimate of 18-20% of all such commonSNPs present in the collection generated by the International SNPMapping Working Group and the SNP Consortium. The difference in coverageis explained by the fact that the present study used larger numbers ofchromosomes for SNP discovery. To assess the replicability of thefindings, SNP discovery was performed for one wafer design with nineteenadditional copies of chromosome 21 derived from the same diversity panelas the original set of samples. A total of 7188 SNPs were identifiedusing the two sets of samples. On average, 66% of all SNPs found in oneset of samples were discovered in the second set, consistent withprevious findings (Marth, et al., Nature Genet. 27:371 (2001) and Yang,et al., Nature Genet. 26:13 (2000)). As expected, failure of a SNP toreplicate in a second set of samples is strongly dependent on allelefrequency. It was found that 80% of SNPs with a minor allele present twoor more times in a set of samples were also found in a second set ofsamples, while only 32% of SNPs with a minor allele present a singletime were found in a second set of samples. These findings suggest thatthe 24,047 SNPs in the collection with a minor allele represented morethan once are highly replicable in different global samples and thatthis set of SNPs is useful for defining common global haplotypes. In thecourse of SNP discovery, 339 SNPs which appeared to have more than twoalleles were identified. These SNPs were not included in the presentanalysis.

In addition to the replicability of SNPs in different samples, thedistance between consecutive SNPs in a collection of SNPs is criticalfor defining meaningful haplotype structure. Haplotype blocks, which canbe as short as several kb, may go unrecognized if the distance betweenconsecutive SNPs in a collection is large relative to the size of theactual haplotype blocks. The collection of SNPs in this study was veryevenly distributed across the chromosome, even though repeat sequenceswere not included in the SNP discovery process. FIG. 13C shows thedistribution of SNP coverage across 32,397,439 bases of finishedchromosome 21 DNA sequence. An interval is the distance betweenconsecutive SNPs. There are a total of 35,988 intervals for the entireSNP set and a total of 24,046 intervals for the common SNP set (i.e.SNPs with a minor allele present more than once in the sample). Theaverage distance between consecutive SNPs was 900 bases when all SNPsare considered, and 1300 bases when only the 24,047 common SNPs wereconsidered. For this set of common SNPs, 93% of intervals betweenconsecutive SNPs in genomic DNA, including repeated DNA, were 4000 basesor less (again, see FIG. 13C).

The construction of haplotype blocks or patterns from diploid data iscomplicated by the fact that the relationship between alleles for anytwo heterozygous SNPs is not directly observable. Consider an individualwith two copies of chromosome 21 and two alleles, A and G, at onechromosome 21 SNP, as well as two alleles, A and G, at a secondchromosome 21 SNP. In such a case, it is, unclear if one copy ofchromosome 21 contains allele A at the first SNP and allele A at thesecond SNP, while the other copy of chromosome 21 contains allele G atthe first SNP and allele G at the second SNP, or if one copy ofchromosome 21 contains allele A at the first SNP and allele G at thesecond SNP, while the other copy of chromosome 21 contains allele G atthe first SNP and allele A at the second SNP. Current methods used tocircumvent this problem include statistical estimation of haplotypefrequencies, direct inference from family data, and allele-specific PCRamplification over short segments.

To avoid these complexities, the present invention characterized SNPs onhaploid copies of chromosome 21 isolated in rodent-human somatic cellhybrids were characterized, allowing direct determination of the fullhaplotypes of these chromosomes. The set of 24,047 SNPs with a minorallele represented more than once in the data set was used to define thehaplotype structure are shown in FIG. 14. The haplotype patterns fortwenty independent globally diverse chromosomes defined by 147 commonhuman chromosome 21 SNPs is shown. The 147 SNPs span 106 kb of genomicDNA sequence. Each row of colored boxes represents a single SNP. Theblack boxes in each row represent the major allele for that SNP, and thewhite boxes represent the minor allele. Absence of a box at any positionin a row indicates missing data. Each column of colored boxes representsa single chromosome, with the SNPs arranged in their physical order onthe chromosome. Invariant bases between consecutive SNPs are notrepresented in the figure. The 147 SNPs are divided into eighteenblocks, defined by black horizontal lines. The position of the base inchromosome 21 genomic DNA sequence defining the beginning of one blockand the end of the adjacent block is indicated by the numbers to theleft of the vertical black line. The expanded boxes on the right of thefigure represent a SNP block defined by 26 common SNPs spanning 19 kb ofgenomic DNA. Of the seven different haplotype patterns represented inthe sample, the four most common patterns include sixteen of the twentychromosomes sampled (i.e. 80% of the sample). The black and whitecircles indicate the allele patterns of two informative SNPs, whichunambiguously distinguish between the four common haplotypes in thisblock. Although no two chromosomes shared an identical haplotype patternfor these 147 SNPs, there are numerous regions in which multiplechromosomes shared a common pattern. One such region, defined by 26 SNPsspanning 19 kb, is expanded for more detailed analysis (again, see theenlarged region of FIG. 14). This block defines seven unique haplotypepatterns in 20 chromosomes. Despite the fact that some data is missingdue to failure to pass the threshold for data quality, in all cases agiven chromosome can be assigned unambiguously to one of the sevenhaplotypes. The four most frequent haplotypes, each of which isrepresented by three or more chromosomes, account for 80% of allchromosomes in the sample. Only two “informative” SNPs out of the totalof twenty-six are required to distinguish the four most frequenthaplotypes from one another. In this example, four chromosomes withinfrequent haplotypes would be incorrectly classified as commonhaplotypes by using information from only these two informative SNPs.Nevertheless, it is remarkable that 80% of the haplotype structure ofthe entire global sample is defined by less than 10% of the total SNPsin the block. Several different possibilities exist in which threeinformative SNPs can be chosen so that each of the four commonhaplotypes is defined uniquely by a single SNP. One of these “three SNP”choices would be preferred over the two SNP combination in an experimentinvolving genotyping of pooled samples, since the two SNP combinationwould not permit determination of frequencies of the four commonhaplotypes in such a situation; thus, the present invention provides adramatic improvement over the random selection method of SNP mapping.

In summary, while the particular application may dictate the selectionof informative SNPs to capture haplotype information, it is clear thatthe majority of the haplotype information in the sample is contained ina very small subset of all the SNPs. It is also clear that randomselection of two or three informative SNPs from this block of SNPs willoften not provide enough information to uniquely assign a chromosome toone of the four common haplotypes.

One issue is how to define a set of contiguous blocks of SNPs spanningthe entire 32.4 Mb of chromosome 21 while minimizing the total number ofSNPs required to define the haplotype structure. In one embodiment, anoptimization algorithm based on a “greedy” strategy was used to addressthis problem. All possible blocks of physically consecutive SNPs of sizeone SNP or larger were considered. Ambiguous haplotype patterns weretreated as missing data and were not included when calculating percentcoverage. Considering the remaining overlapping blocks simultaneously,the block with the maximum ratio of total SNPs in the block to theminimal number of SNPs required to uniquely discriminate haplotypesrepresented more than once in the block was selected. Any of theremaining blocks that physically overlapped with the selected block werediscarded, and the process was repeated until a set of contiguous,non-overlapping blocks that cover the 32.4 Mb of chromosome 21 with nogaps, and with every SNP assigned to a block, was selected. Given thesample size of twenty chromosomes, the algorithm produces a maximum often common haplotype patterns per block, each represented by twoindependent chromosomes.

Applying this algorithm to the data set of 24,047 common SNPs, 4135blocks of SNPs spanning chromosome 21 were defined. A total of 589blocks, comprising 14% of all blocks, contain greater than ten SNPs perblock and include 44% of the total 32.4 Mb. In contrast, 2138 blocks,comprising 52% of all blocks, contain less than three SNPs per block andmake up only 20% of the physical length of the chromosome. The largestblock contains 114 common SNPs and spans 115 kb of genomic DNA. Overall,the average physical size of a block is 7.8 kb. The size of a block isnot correlated with its order on the chromosome, and large blocks areinterspersed with small blocks along the length of the chromosome. Thereare an average of 2.7 common haplotype patterns per block, defined ashaplotype patterns that are observed on multiple chromosomes. Onaverage, the most frequent haplotype pattern in a block is representedby 9.6 chromosomes out of the twenty chromosomes in the sample, thesecond most frequent haplotype pattern is represented by 4.2chromosomes, and the third most frequent haplotype patterns, if present,is represented by 2.1 chromosomes. The fact that such a large fractionof globally diverse chromosomes are represented by such limitedhaplotype diversity is remarkable. The findings are consistent with theobservation that when haplotype pattern frequency is considered, 82% ofthe haplotype patterns observed in a collection of 313 human genes areobserved in all ethnic groups, while only 8% of haplotypes arepopulation specific (Stephens, et al., Science 293:489-93 (2001)).

Several experiments were performed to measure the influence ofparameters of the haplotype algorithm on the resulting block patterns.The fraction of chromosomes required to be covered by common haplotypeswas varied, from an initial 80%, to 70% and 90%. As would be expected,requiring more complete coverage results in somewhat larger numbers ofshorter blocks. Using only the 16,503 SNPs with a minor allele frequencyof at least 20% in the sample resulted in somewhat longer blocks, butthe numbers of SNPs per block did not change significantly. For oneregion of about 3 Mb, a deeper sample of 38 chromosomes for SNPs andcommon haplotype blocks with at least 10% frequency was analyzed, so asto be comparable with the 20 chromosome analysis. The resultingdistribution of block sizes closely matched the initial results. Also, arandomization test was performed in which the non-ambiguous alleles ateach SNP were permuted, and then used for haplotype block discovery. Inthis analysis, 94% of blocks contained fewer than three SNPs, and onlyone block contained more than five SNPs. This confirms that the largerblocks seen in the data cannot be produced by chance associations or asartifacts of the block selection methods of the present invention.

In an effort to determine if genes were proportionately represented inboth large and small blocks, a determination was made of the number ofexonic bases in blocks containing more than 10 SNPs, 3 to 10 SNPs, andless than 3 SNPs. Exonic bases are somewhat over-represented as comparedto total bases in blocks containing 3 to 10 SNPs (p<0.05 as determinedby a permutation test).

Based on knowledge of the haplotype structure within blocks, subsets ofthe 24,047 common SNPs can be selected to capture any desired fractionof the common haplotype information, defined as complete information forhaplotypes present more than once and including greater than 80% of thesample across the entire 32.4 Mb. FIG. 15 shows the number of SNPsrequired to capture the common haplotype information for 32.4 Mb ofchromosome 21. For each SNP block, the minimum number of SNPs requiredto unambiguously distinguish haplotypes in that block that are presentmore than once (i.e., common haplotype information) was determined.These SNPs provide common haplotype information for the fraction of thetotal physical distance defined by that block. Beginning with the SNPsthat provide common haplotype information for the greatest physicaldistance, the cumulative increase in physical coverage (i.e., fractioncovered) is plotted relative to the number of SNPs added (i.e., SNPsrequired). Genic DNA includes all genomic DNA beginning 10 kb 5′ of thefirst exon of each known chromosome 21 gene and extending 10 kb 3′ ofthe last exon of that gene. For example, while a minimum of 4563 SNPsare required to capture all the common haplotype information, only 2793SNPs are required to capture the common haplotype information in blockscontaining three or more SNPs that cover 81% of the 32.4 Mb. A total of1794 SNPs are required to capture all the common haplotype informationin genic DNA, representing approximately two hundred and twenty distinctgenes.

The present invention has particular relevance for whole-genomeassociation studies mapping phenotypes such as common disease genes.This approach relies on the hypothesis that common genetic variants areresponsible for susceptibility to common diseases (Risch and Merikangas,Science 273:1516 (1996), Lander, Science 274:536 (1996)). By comparingthe frequency of genetic variants in unrelated cases and controls,genetic association studies can identify specific haplotypes in thehuman genome that play important roles in disease. While this approachhas been used to successfully associate single candidate genes withdisease (Altschuler, et al., Nature Genet. 26:76 (2000)), the recentavailability of the human DNA sequence offers the possibility ofsurveying the entire genome, dramatically increasing the power ofgenetic association analysis (Kruglyak, Nature Genet. 22:139 (1999)). Amajor limitation to, the implementation of this method has been lack ofknowledge of the haplotype structure of the human genome, which isrequired in order to select the appropriate genetic variants foranalysis. The present invention demonstrates that high-densityoligonucleotide arrays in combination with somatic cell genetic samplepreparation provide a high-resolution approach to empirically define thecommon haplotype structure of the human genome.

Although the length of genomic regions with a simple haplotype structureis extremely variable, a dense set of common SNPs enables the systematicapproach to define blocks of the human genome in which 80% of the globalhuman population is described by only three common haplotypes. Ingeneral, when applying the particular algorithm used in this embodiment,the most common haplotype in any block is found in 50% of individuals,the second most common in 25% of individuals, and the third most commonin 12.5% of individuals. It is important to note that blocks are definedbased on their genetic information content and not on knowledge of howthis information originated or why it exists. As such, blocks do nothave absolute boundaries, and may be defined in different ways,depending on the specific application. The algorithm in this embodimentprovides only one of many possible approaches. The results indicate thata very dense set of SNPs is required to capture all the common haplotypeinformation. Once in hand, however, this information can be used toidentify much smaller subsets of SNPs useful for comprehensivewhole-genome association studies.

Those skilled in the art will appreciate readily that the techniquesapplied to human chromosome 21 can be applied to all the chromosomespresent in the human genome. In a preferred embodiment of the presentinvention, multiple whole genomes of a diverse population representativeof the human species are used to identify SNP haplotype blocks common toall or most members of the species. In some embodiments, SNP haplotypeblocks are based on ancient SNPs by excluding SNPs that are representedat low frequency. The ancient SNPs are likely to be important as theyhave been preserved in the genome because they impart some selectivebenefit to organisms carrying them.

Example 6 Using Associated Genes for Gene Therapy and Drug Discovery

One example for using the methods of the present invention is outlinedin this prophetic example. SNP discovery is performed on twenty haploidgenomes, and fifty haploid genomes are analyzed by the methods of thepresent invention to determine SNP haplotype blocks, SNP haplotypepatterns, informative SNPs and minor allele frequency for eachinformative SNP. These fifty haploid genomes comprise the controlgenomes of the present study (see step 1300 of FIG. 13).

Next, genomic DNA from 500 individuals having an obesity phenotype areassayed for variants by using long distance PCR and microarrays asdescribed supra (see also, U.S. Pat. No. 6,300,063 issued to Lipshutz,et al., and U.S. Pat. No. 5,837,832 to Chee, et al.), and the frequencyof the minor allele for each informative SNP is determined for thisclinical population (see step 1310 of FIG. 13). The minor allelefrequencies of the informative SNPs for the two populations arecompared, and the control and clinical populations are determined tohave statistically significant differences in three informative SNPlocations (steps 1320 and 1330). The SNP location with the largestdifference in the minor allele frequency between the control andclinical populations is selected for analysis.

The informative location selected is contained within a SNP haplotypeblock that is found to span 1 kb of noncoding sequence 5′ of the codingregion and 4 kb of the coding region of the leptin gene (step 1340).Analysis of the variations contained within this region indicates that aG at one SNP position in this region is responsible for destruction ofthe promoter for the leptin gene, with a commensurate lack of expressionof the leptin protein.

Fibroblasts are obtained from a subject by skin biopsy. The resultingtissue is placed in tissue-culture medium and separated into smallpieces. Small pieces of the tissue are placed on the bottom of a wetsurface of a tissue culture flask with medium. After 24 hours at roomtemperature, fresh media is added (e.g., Ham's F12 media, with 10% FBS,penicillin and streptomycin). The tissue is then incubated at 37° C. forapproximately one week. At this time, fresh media is added andsubsequently changed every several days. After an additional two weeksin culture, a monolayer of fibroblasts emerges. The monolayer istrypsinized and scaled into larger flasks.

The vector derived from the Moloney murine leukemia virus, whichcontains a kanamycin resistance gene, is digested with restrictionenzymes for cloning a fragment to be expressed. The digested vector istreated with calf intestinal phosphatase to prevent self-ligation. Thedephosphorylated, linear vector is fractionated on an agarose gel andpurified. Leptin cDNA, capable of expressing active leptin proteinproduct, is isolated. The ends of the fragment are modified, ifnecessary, for cloning into the vector. Equal molar quantities of theMoloney murine leukemia virus linear backbone and the leptin genefragment are mixed together and joined using T4 DNA ligase. The ligationmixture is used to transform E. coli and the bacteria are then platedonto agar-containing kanamycin. Kanamycin phenotype and restrictionanalysis confirm that the vector has the properly inserted leptin gene.

Packaging cells are grown in tissue culture to confluent density inDulbecco's Modified Eagles Medium (DMEM) with 10% calf serum, penicillinand streptomycin. The vector containing the leptin gene is introducedinto the packaging cells by standard technique. Fresh media is added tothe packaging cells, and after an appropriate incubation period, mediais harvested from the plates of confluent packaging cells. The media,containing the infectious viral particles, is filtered through aMillipore filter to remove detached packaging cells, then is used toinfect fibroblast cells. Media is removed from a sub-confluent plate offibroblasts and quickly replaced with the filtered media. Polybrene(Aldrich) may be included in the media to facilitate transduction. Afterappropriate incubation, the media is removed and replaced with freshmedia. If the titer of virus is high, then virtually all fibroblastswill be infected and no selection is required. If the titer is low, thenit is necessary to use a retroviral vector that has a selectable marker,such as neo or his, to select out transduced cells for expansion.

Engineered fibroblasts then are introduced into individuals, eitheralone or after having been grown to confluence on microcarrier beads,such as cytodex 3 beads. The injected fibroblasts produce leptinproduct, and the biological actions of the protein are conveyed to thehost.

Alternatively or in addition, the leptin gene is isolated, cloned intoan expression vector and employed for producing leptin polypeptides. Theexpression vector contains suitable transcriptional and translationalinitiation regions, and transcriptional and translational terminationregions, as disclosed supra. Isolated leptin protein can be produced inthis manner and used to identify agents which bind it; alternativelycells expressing the engineered leptin gene and protein are used inassays to identify agents. Such agents are identified by, for example,contacting a candidate agent with an isolated leptin polypeptide for atime sufficient to form a polypeptide/compound complex, and detectingthe complex. If a polypeptide/compound complex is detected, the compoundthat binds to the leptin polypeptide is identified. Agents identifiedvia this method can include compounds that modulate activity of leptin.Agents screened in this manner are peptides, carbohydrates, vitaminderivatives, and other small molecules or pharmaceutical agents. Inaddition to biological assays to identify agents, agents may bepre-screened by choosing candidate agents selected by using proteinmodeling techniques, based on the configuration of the leptin protein.

In addition to identifying agents that bind the leptin protein,sequence-specific or element-specific agents that control geneexpression through binding to the leptin gene are also identified. Oneclass of nucleic acid binding agents are agents that contain baseresidues that hybridize to leptin mRNA to block translation (e.g.,antisense oligonucleotides). Another class of nucleic acid bindingagents are those that form a triple helix with DNA to blocktranscription (triplex oligonucleotides). Such agents usually contain 20to 40 bases, are based on the classic phosphodiester, ribonucleic acidbackbone, or can be a variety of sulfhydryl or polymeric derivativesthat have base attachment capacity.

Additionally, allele-specific oligonucleotides that hybridizespecifically to the leptin gene and/or agents that bind specifically tothe variant leptin protein (e.g., a variant-specific antibody) can beused as diagnostic agents. Methods for preparing and usingallele-specific oligonucleotides and for preparing antibodies aredescribed supra and are known in the art.

All patents and publications mentioned in this specification areindicative of the levels of those skilled in the art to which theinvention pertains. All patents and publications are herein incorporatedby reference to the same extent as if each individual publication wasspecifically and individually indicated to be incorporated by reference.

The present invention provides greatly improved methods for conductinggenome-wide association studies by identifying individual variations,determining SNP haplotype blocks, determining haplotype patterns and,further, using the SNP haplotype patterns to identify informative SNPs.The informative SNPs may be used to dissect the genetic bases of diseaseand drug response in a practical and cost effective manner unknownpreviously. It is to be understood that the above description isintended to be illustrative and not restrictive. Many embodiments willbe apparent to those skilled in the art upon reviewing the abovedescription. The scope of the invention should, therefore, be determinednot with reference to the above description, but should instead bedetermined with reference to the appended claims, along with the fullscope of equivalents to which such claims are entitled.

1. A method for identifying pharmacogenomic-related genetic loci,comprising: a) determining SNP haplotype patterns comprising a set ofSNP alleles from a plurality of individuals in a first population,wherein said individuals in said first population react in a particularmanner to administration of a substance; b) determining SNP haplotypepatterns comprising a set of SNP alleles from a plurality of individualsin a second population, wherein said individuals in said secondpopulation do not react in said particular manner to administration ofsaid substance; and c) comparing frequencies of said SNP haplotypepatterns from said individuals in said first population with frequenciesfrom said SNP haplotype patterns of said individuals in said secondpopulation, wherein genomic locations of SNP haplotype patterns thatexhibit a difference in said frequencies are pharmacogenomic-relatedgenetic loci.
 2. The method of claim 1, wherein said identifying isperformed without a priori knowledge of a sequence or location of saidpharmacogenomic-related genetic loci.
 3. The method of claim 1, whereinsaid SNP haplotype patterns are determined in at least 10 individuals inat least one of said first population or said second population.
 4. Themethod of claim 1, wherein said SNP haplotype patterns recited in stepsa) and b) are determined using informative SNPs.
 5. The method of claim1, wherein in step a) and step b) said determining of SNP haplotypepatterns is performed using pooled genomic DNA samples from individualsof said first population and said second population, respectively. 6.The method of claim 1, further comprising using saidpharmacogenomic-related genetic loci as diagnostic markers for a givencondition or phenotypic trait.
 7. The method of claim 6, wherein saiddiagnostic markers are included in a kit for diagnosis of a disease,disease susceptibility, or treatment response.
 8. The method of claim 7,wherein said kit further comprises means for detecting a presence orabsence of said pharmacogenomic-related genetic loci in a sample ofgenomic DNA from a patient.
 9. The method of claim 1 further comprisingbuilding a baseline of SNP haplotype patterns, wherein said building abaseline of SNP haplotype patterns comprises: identifying geneticvariations in a plurality of individuals; identifying at least some ofsaid genetic variations in individuals that occur with at least someother of said genetic variations; grouping said some of said variationsin individuals that occur with said some other of said geneticvariations into SNP haplotype blocks; identifying SNP haplotype patternsin said SNP haplotype blocks; and adding said SNP haplotype patterns insaid SNP haplotype blocks to said baseline of SNP haplotype patterns,thereby building said baseline of SNP haplotype patterns, wherein saidbaseline of SNP haplotype patterns comprises said SNP haplotype patternsfrom said individuals in said second population, and wherein saidbaseline of SNP haplotype patterns further comprises said SNP haplotypepatterns from said individuals in said first population.
 10. The methodof claim 1 further comprising building a baseline of SNP haplotypepatterns, wherein said building a baseline of SNP haplotype patternscomprises: determining a sequence of an organism; scanning additionalindividuals of said organism for variants from said sequence;identifying some of said variants that occur with others of saidvariants in a first SNP haplotype block; identifying some of saidvariants that occur with others of said variants in a second SNPhaplotype block; identifying SNP haplotype patterns in said first SNPhaplotype block and said second SNP haplotype block; and adding said SNPhaplotype patterns in said first SNP haplotype block and said second SNPhaplotype block to said baseline of SNP haplotype patterns, therebybuilding said baseline of SNP haplotype patterns, wherein said baselineof SNP haplotype patterns comprises said SNP haplotype patterns fromsaid individuals in said second population, and wherein said baseline ofSNP haplotype patterns further comprises said SNP haplotype patternsfrom said individuals in said first population.
 11. A method foridentifying pharmacogenomic-related loci without a priori knowledge of asequence or location of said pharmacogenomic-related loci, comprising:identifying genetic variations in a plurality of individuals;identifying at least some of said genetic variations that occur with atleast some others of said genetic variations; genotyping a subset ofsaid at least some of said genetic variations that occur with at leastsome others of said genetic variations in both a case population and acontrol population to generate a data set of genotypes, wherein saidcase population consists of individuals who exhibit a particularresponse to a treatment and said control population consists ofindividuals who do not exhibit said particular response; based on saiddata set of genotypes, computing a genotype frequency in said casepopulation and a genotype frequency in said control population for eachof said subset; and identifying as pharmacogenomic-related loci a set ofgenetic variants for which said genotype frequency in said casepopulation is different than said genotype frequency in said controlpopulation.
 12. The method of claim 11, wherein said genetic variantsare SNPs. 13-16. (canceled)
 17. The method of claim 11, furthercomprising using said pharmacogenomic-related loci for a purposeselected from the group consisting of: testing of a candidate agent;stratification of a population for clinical studies; development of agene therapy; and use as a drug discovery target.
 18. A method foridentifying pharmacogenomic-related loci without a priori knowledge of asequence or location of said pharmacogenomic-related loci, comprising:determining a sequence of an organism; scanning additional individualsof said organism for genetic variants from said sequence; identifying afirst subset of said genetic variants that occur with others of saidgenetic variants in a first group; identifying a second subset of saidgenetic variants that occur with others of said genetic variants in asecond group; and using some, but not all, of said genetic variants insaid first and second groups in an association study to identify whichof said genetic variants is correlated with a phenotypic state, whereinsaid phenotypic state is a response to a pharmaceutical treatment, andfurther wherein genetic variants that are correlated with saidphenotypic state are pharmacogenomic-related loci.
 19. The method ofclaim 18, wherein said genetic variants are SNPs. 20-21. (canceled) 22.The method of claim 18, wherein said sequence of an organism comprisesat least 10,000 bases of genomic DNA of said organism.
 23. (canceled)24. The method of claim 18, further comprising using saidpharmacogenomic-related loci for a purpose selected from the groupconsisting of: testing of a candidate agent; stratification of apopulation for clinical studies; development of a gene therapy; and useas a drug discovery target.
 25. A method for determiningpharmacogenomic-related genetic loci without a priori knowledge of asequence or location of said pharmacogenomic-related genetic loci,comprising: a) determining control genotypes, wherein said controlgenotypes are genotypes from at least 16 individuals in a controlpopulation for a set of genomic loci; b) determining case genotypes,wherein said case genotypes are genotypes from individuals that react inan altered manner to administration of a substance for said set ofgenomic loci; and c) comparing frequencies of said control genotypeswith frequencies of said case genotypes, wherein members of said set ofgenomic loci that exhibit differences in said frequencies indicatelocations of pharmacogenomic-related genetic loci.
 26. The method ofclaim 25, further comprising using said pharmacogenomic-related geneticloci for a purpose selected from the group consisting of: testing of acandidate agent; stratification of a population for clinical studies;development of a gene therapy; and use as a drug discovery target.